Authorship Attribution and Optical Character Recognition Errors

Patrick Juola*,**, John I. Noecker Jr** and Michael V. Ryan**
*Evaluating Variations in Language Laboratory; Duquesne University; Pittsburgh; Pennsylvania; USA
**Juola & Associates; Pittsburgh; Pennsylvania; USA
Résumé (en anglais)
Stylometric authorship attribution is a fundamental problem. The basic idea behind the research is that one can determine the authorship of a document on the basis of cognitive and linguistic quirks that uniquely identify a person. In many cases, however, noise in the original documents can make this analysis more difficult and less reliable. We investigate the errors introduced by a typical optical character recognition (OCR) process. Using simulated (random) errors in a standard benchmark corpus, we test to see how sensitive the authorship attribution process is to character mis-recognition. Our results indicate that, while accuracy decreases measurably with noise, the decrease is not substantial.