Learning Domain-Speciﬁc, L1-Speciﬁc Measures of Word Readability

Shane Bergsma^* et David Yarowsky^*

^*Dept. of Computer Science and Human Language Technology Center of Excellence; Johns Hopkins University; Baltimore, Maryland; USA; shane.a.bergsma@gmail.com, yarowsky@cs.jhu.edu

Résumé

Improved readability ratings for second-language readers could have a huge impact in areas such as education, advertising, and information retrieval. We propose ways to adapt readability measures for users who (a) are proﬁcient in a particular domain, and (b) have a particular native language (L1). Speciﬁcally, we predict the readability of individual words. Our learned models use a range of creative features based on diverse statistical, etymological, lexical, and morphological information. We evaluate on a corpus of computational linguistics articles divided according to seven L1s ; we show that we can accurately predict the target readability scores in this domain. Our technique improves over several reasonable baselines. We provide an in-depth analysis showing which kinds of information are most predictive of word difﬁculty in different L1s, and show how this differs for style and content words.

Paru dans

Varia

Document

TAL_54_1_7.pdf

Rank