« Reading List: I'm the Teacher, You're the Student | Main | A World, at Last »

Sunday, January 16, 2005

Creating Multilingual Web Documents with Apache HTTPD

I host a French-language site on my Web server devoted to the history of the village where I live in Switzerland. We intend to make an English translation of this available for anglophone emigrants researching the origin of their ancestors, so I've been looking into present-day, less painful alternatives to the traditional approach of creating copies of everything (except the images, sound files, etc.) in different directories or with file names based on the language, then fixing bazillions of links in zillions of files accordingly.

It turns out there's a feature in the Apache HTTPD which, in conjunction with the "content negotiation" facility in HTTP 1.1, provides a reasonably painless solution to migrating a single-language site to a multilingual one in an incremental fashion without disrupting everything already in place.

In order to use this, your Apache server must support the mod_negotiation module, but as this is enabled by default, most installations already include it. The directory tree which includes the multilingual documents must be configured with the "MultiViews" option. This option is not implied by "All"--you must explicitly set it in your HTTPD configuration, or with an "Options MultiViews" statement in a .htaccess file in the directory with the multilingual pages or a parent directory above it. You can only specify this option if the "AllowOverride" directive in httpd.conf permits it. If you don't have control of your HTTP server's configuration, you'll have to take this information to the ogre who administers your server and beg on bended knee that s/he/it grant you this boon.

With the enabling .htaccess file in the document directory, we can now begin to transform the single language documents into a multilingual ensemble. Let's start with the main page, "index.html", which is displayed when a user enters the directory. Any page with a simple extension of ".html" bypasses language negotiation, so the first step is to rename this page as "index.html.html". Huh? Well, it's an idiom--a page so named is deemed the default for users whose language preference isn't satisfied by any language-specific page in the directory. You should choose the language of the majority of visitors to your site for this page. That way, a French-speaking user who's checking your predominantly French-language site site for updates from an Internet café in Pakistan where the language preference has been set to Urdu (ur-pk) will see your best guess as to the most probable language. The default page should contain links, right at the top or otherwise obvious when the page appears, to each of the translations available. Use flag icons and translations of the language name (for example, "Deutsch", not "German"; "עברית", not "Hebrew") for these links.

Next, create pages for each natural language translation of pages in the directory. For index.html, we might have:

index.html.de     German
index.html.enEnglish
index.html.frFrench
index.html.itItalian
index.html.rmRhæto-Roman
Now, when a user tries to access the document index.html, which doesn't exist in the directory, the Apache server looks at the user's language preferences and returns the first available page which matches the highest priority language preference or, should none match, the default page, index.html.html.

If a user has set up their language preferences in the browser, the result is "indistinguishable from magic"--the proper language appears for each page viewed, and links within those pages have no need to be language-specific, since when an explicit lookup of the URL fails, the language preference is used to obtain the best page for the user. Thus, a page which links to "archives/postcards.html" will, if no postcards.html file exists in directory archives, choose the best language match or postcards.html.html as the default.

I've set up a toy directory to play with this. Visit this directory, and you should see your language of choice, with English, French, German, and Italian included, plus a "Multilingual Default" page for those with language preferences which include none of the available languages. Each page includes explicit links to those in the other languages.

There are, as always, some significant gotchas here. First of all, the visitor to your site may be using a browser which is clueless when it comes to content negotiation. When following a link to a directory in which multiple language versions of "index.html.*" files are present, they'll see only the directory listing and have to figure out clicking on their preferred language code. Worse, on explicit links to multilingual documents, it's the Dreaded 404 they'll see. It's up to you to weigh the benefits of nearly painless migration to a multilingual site against brickbats from users of antiquated browsers.

Next, you have to worry about excess specificity in language preference on the part of your users. It seems that some browsers downloaded by users in the U.S. come with a default language preference of "en-us", the dialect of English spoken by denizens of that polity. You'd expect that if a Web page specified its language as "en" and the user requested a more specific version of that language, the general version would be delivered. But you'd be wrong--language matching, at least as far as I can determine based on my experiments, is a pure string comparison, so you might well consider making symbolic links to your ".en" documents from ".en-us" and so on.

And finally, be sure your ".html.html" default page includes prominent, localised links to each of its language-specific peers, and that your language specific pages have a means to switch to the other available languages. One hopes that a browser downloaded by a user who requests a given language for its user interface will request documents in that language as its first choice, but you never know.

What pops out of a search engine is knowable only to the pigeons savants at Google and their painstakingly pecking peers at the competition, so to avoid inadvertent «faux amis» hits in search engines, don't forget to specify the principal language for each of your pages, ideally in the <html> tag, for example, <html lang="la"> for a document in Latin.

This is more or less a first draft of a tutorial on MultiViews I'll eventually publish on the main site. I'm posting it here to run it by the most critical readers and bleeding-edge early adopters (flattery, yes, but perfectly true) to invite corrections and brickbats before I encourage others to build multilingual sites this way.

Posted at January 16, 2005 01:30