Learning Domain-Specific, L1-Specific Measures of Word Readability

Shane Bergsma, David Yarowsky
Dept. of Computer Science and Human Language Technology Center of Excellence
Johns Hopkins University
Baltimore, Maryland
shane.a.bergsma@gmail.com, yarowsky@cs.jhu.edu
Improved readability ratings for second-language readers could have a huge impact in areas such as education, advertising, and information retrieval. We propose ways to adapt readability measures for users who (a) are proficient in a particular domain, and (b) have a particular native language (L1). Specifically, we predict the readability of individual words. Our learned models use a range of creative features based on diverse statistical, etymological, lexical, and morphological information. We evaluate on a corpus of computational linguistics articles divided according to seven L1s ; we show that we can accurately predict the target readability scores in this domain. Our technique improves over several reasonable baselines. We provide an in-depth analysis showing which kinds of information are most predictive of word difficulty in different L1s, and show how this differs for style and content words.