Building a Corpus-based Historical Portuguese Dictionary : Challenges and Opportunities

Arnaldo Candido Junior, Sandra Maria Aluísio
 
Center of Computational Linguistics (NILC)/ Department of Computer Sciences,
University of São Paulo, Av. Trabalhador São-Carlense, 400, 13560-970 - São
Carlos/SP, Brazil
arnaldoc@ icmc. usp.br, sandra@ icmc.usp.br

 

Historical corpora are important resources for different areas. Philology, Human Language Technology, Literary Studies, History, and Lexicography are some that benefit from them. However, compiling historical corpora is different from compiling contemporary corpora. Corpus designers have to deal with several characteristics inherent in historical texts, such as : absence of a spelling standard, pervasive use of abbreviations plus their spelling variations, lack of space between words, irregular use of hyphenation, non-standard typographical symbols. This paper addresses the challenges posed in processing the corpus designed for the Historical Dictionary of Brazilian Portuguese (HDBP) project, which is composed of texts from the sixteenth through the beginning of the nineteenth century, and the solutions found to support the compilation of a Historical Portuguese dictionary based on this corpus.