Identification of Cognates and Recurrent Sound Correspondences in Word Lists

Grzegorz Kondrak*
*Department of Computing Science; University of Alberta; Edmonton, AB T6G 2E8, Canada;
Résumé (en anglais)
Identification of cognates and recurrent sound correspondences is a component of two principal tasks of historical linguistics : demonstrating the relatedness of languages, and reconstructing the histories of language families. We propose methods for detecting and quantifying three characteristics of cognates : recurrent sound correspondences, phonetic similarity, and semantic affinity. The ultimate goal is to identify cognates and correspondences directly from lists of words representing pairs of languages that are known to be related. The proposed solutions are language independent, and are evaluated against authentic linguistic data. The results of evaluation experiments involving the Indo-European, Algonquian, and Totonac language families indicate that our methods are more accurate than comparable programs, and achieve high precision and recall on various test sets. The results also suggest that combining various types of evidence substantially increases cognate identification accuracy.