Journal of Systems Integration, Vol 6, No 4 (2015)

Font Size:  Small  Medium  Large

Automatic Encoding and Language Detection in the GSDL – Part II

Otakar Pinkas

Abstract


The processing of the older MS Word format in the GSDL depends on the correct encoding of the temporary HTML file. The “windows-scripting” fails, but the wvware.exe program is successful. The actual .docx format needs user to change the setting in the Word configuration. A temporary HTML file should be encoded in UTF-8 instead of the Windows-1250 preset in the Czech environment. The automatic conversion from ISO-8859-2 to Windows-1250 for HTML pages is wrong, but the conversion ISO-8859-1 to Windows-1252 is valid. The automatic language detection is sometimes incorrect due to the predomination of a similar language model. The automatic language detection needs further investigation.

Full Text: PDF

DOI: http://dx.doi.org/10.20470/jsi.v6i4.238

ISSN: 1804-2724

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 Czech Republic License.