« Reading List: Energiya-Buran | Main | Reading List: Dragon's Teeth. Vol. 2 »

Thursday, January 15, 2009

The Hacker's Diet Online: Unicode Character Disaster

The first major disaster since the recent server update has struck The Hacker's Diet Online, causing non-ASCII Unicode characters (for example, accented letters; Greek, Cyrillic, and other alphabets; and Chinese and Japanese characters) appearing in comment fields of log forms and in other contexts such as account information) to be garbled. In addition, users whose user names and/or passwords contained such characters may have had problems logging in.

It turns out (based on a preliminary analysis, performed under the gun, which may be revised as I investigate further) that the Perl function decode_utf8 which, under Perl 5.8.5 on the old server, did precisely what its name implies: decode a string containing byte codes in UTF-8 encoding into a Unicode string, now, on Perl 5.8.8, has become “smart” (in the Microsoft sense of the word) and decides whether or not to do what you invoked it to do based upon whether the string argument it was passed was tagged as having been in UTF-8 encoding. When processing arguments to CGI scripts on Web sites, there's a bit of a problem in handling encoding: normally you want to decode UTF-8 arguments, but in the case of file uploads, the decoding must be suppressed lest binary files be corrupted. Further, arguments passed via the QUERY-STRING from get requests and from standard input for post requests require different handling of encoding. Stir “smartness” on the part of a Perl function into this mess and you're asking for a disaster—which was duly delivered.

I have applied a one-line patch to The Hacker's Diet Online which, based upon my testing so far, appears to fix the problem. Since there are so many forms in the application which accept Unicode characters and a variety of types of input (text fields, passwords, uploaded files), it will take some time to verify that everything is now working correctly. I'm pretty confident that the clamant problem of corruption of log comment fields is now corrected. Users who have had their comment fields wrecked due to this problem and do not wish to retype the affected comments should contact me via the feedback form, and I'll try to restore any comments which have been corrupted from the daily backup tapes of the server farm.

Posted at January 15, 2009 00:15