Cocytus : parallel NLP over disparate data

Noah Evans, Masayuki Asahara, Yuji Matsumoto
Nara Institute of Science and Technology
8916-5, Takayama-cho, Ikoma-shi
Nara 630-0192 JAPAN
 
As NLP deals with larger datasets and more computationally expensive algorithms, cutting-edge NLP research is increasingly becoming the province of companies like Google who can use an astronomical amount of resources to do NLP tasks. Smaller institutions are being left behind. In addition to this lack of resources, what resources a typical researcher does have access to are represented in a variety of differing, incompatible data formats and operating system semantics. NLP researchers devote a large amount of research time developing NLP tools to support a variety of different data formats, time that could be spent doing productive research. To solve these problems of data representation and processing huge data, this paper presents Cocytus, a platform for creating NLP tools loosely based on Unix, that handles different data formats and parallel computation transparently, thus allowing institutions to make maximum use of their resources.