Journal of Systems Integration, Vol 5, No 4 (2014)

Font Size:  Small  Medium  Large

Automatic Encoding and Language Detection in the GSDL

Otakar Pinkas

Abstract


Automatic detection of encoding and language of the text is part of the Greenstone Digital Library Software (GSDL) for building and distributing digital collections. It is developed by the University of Waikato (New Zealand) in cooperation with UNESCO. The automatic encoding and language detection in Slavic languages is difficult and it sometimes fails. The aim is to detect cases of failure. The automatic detection in the GSDL is based on n-grams method. The most frequent n-grams for Czech are presented. The whole process of automatic detection in the GSDL is described. The input documents to test collections are plain texts encoded in ISO-8859-1, ISO-8859-2 and Windows-1250. We manually evaluated the quality of automatic detection. To the causes of errors belong the improper language model predominance and the incorrect switch to Windows-1250. We carried out further tests on documents that were more complex. We devote them a separate article.

Full Text: PDF

DOI: http://dx.doi.org/10.20470/jsi.v5i4.211

ISSN: 1804-2724

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 Czech Republic License.