I figured it'd be fun to paste in some foreign-language text and see how my site handles it now that I do Unicode 
I got the following texts from this helpful Unicode test page. I really wanted to find the Shema somewhere where it wasn't shown as an image, but I wasn't able to.
По оживлённым берегам
Громады стройные теснятся
Дворцов и башен; корабли
Толпой со всех концов земли
К богатым пристаням стремятся;
Ἰοὺ ἰού· τὰ πάντʼ ἂν ἐξήκοι σαφῆ.
Ὦ φῶς, τελευταῖόν σε προσϐλέψαιμι νῦν,
ὅστις πέφασμαι φύς τʼ ἀφʼ ὧν οὐ χρῆν, ξὺν οἷς τʼ
οὐ χρῆν ὁμιλῶν, οὕς τέ μʼ οὐκ ἔδει κτανών.
पशुपतिरपि तान्यहानि कृच्छ्राद्
अगमयदद्रिसुतासमागमोत्कः ।
कमपरमवशं न विप्रकुर्युर्
विभुमपि तं यदमी स्पृशन्ति भावाः ॥
This is supposedly Chinese, but my browser doesn't have the fonts installed so I get a bunch of question marks:
子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?
人不知而不慍,不亦君子乎?」
有子曰:「其為人也孝弟,而好犯上者,鮮矣;
不好犯上,而好作亂者,未之有也。君子務本,本立而道生。
孝弟也者,其為仁之本與!」
ஸ்றீனிவாஸ ராமானுஜன் ஐயங்கார்
بِسْمِ ٱللّٰهِ ٱلرَّحْمـَبنِ ٱلرَّحِيمِ
ٱلْحَمْدُ لِلّٰهِ رَبِّ ٱلْعَالَمِينَ
ٱلرَّحْمـَبنِ ٱلرَّحِيمِ
مَـالِكِ يَوْمِ ٱلدِّينِ
إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ
ٱهْدِنَــــا ٱلصِّرَاطَ ٱلمُسْتَقِيمَ
صِرَاطَ ٱلَّذِينَ أَنعَمْتَ عَلَيهِمْ غَيرِ ٱلمَغضُوبِ عَلَيهِمْ وَلاَ ٱلضَّالِّين
Here's testing some punctiation:
The convention in English is “to use double quotation marks to indicate quotation, and ‘single quotation marks’ for nested quotations.”
En français la convention est « d'utiliser les guillemets français doubles pour les citations, et “ les guillemets anglais doubles ” ou bien ‹ les guillemets français simples › pour les citations imbriquées. »
Auf Deutsch ist die Vereinbarung »umgekehrte zweifache Anführungszeichen für die Zitate zu benutzen, sogar ›einfache Anführungszeichen‹ für die verschachtelte Zitate«; diese Anführungszeichen „dürfen auch solche ‚englische‘ Anführungszeichen sein.“
The en-dash is used between numbers such as in: 1685–1750 (J. S. Bach). It is longer than the hyphen (as in “en-dash”, or, more properly, “en‐dash”) but shorter than the em-dash, which is used — like this — as a sort of parenthesis. Neither should be confused with the horizontal bar which is used to introduce quotation in some cases.
― Like this?
― Right.
The ellipsis is… well, it just is.
It'll be interesting to see if my StructuredText parser dies on all of this.
Other than that, a concern I have is that someone would be able to post invalid Unicode data to my site and have me take it. I wonder if any other environments (than PHP) ensure that text coming into it is in the proper encoding.
Ok, here goes. Will I corrupt MySQL? Will I make PHP crash? Tune in next time, same bat time, same bat channel.
Ok then. Everything seems to have worked (but of course I still can't verify the Chinese), except that the Arabic is not right-justified. I wonder what I have to do to do that. The page I got this from had the following to begin the Arabic blockquote: <blockquote xml:lang="ar" lang="ar" dir="rtl">. Language and direction are two things I guess I should shoehorn into my StructuredText parser. Interestingly, that page's encoding was ASCII and they used all entities to include the other languages, so that page didn't actually use Unicode itself at all.
I just finished everything I said I'd do a few days ago.
- All of my old entries that had been written prior to me using StructuredText on my blog have been converted from HTML to StructuredText. Much of it I was able to automate, but parts of it I had to do manually. I converted over 1700 posts. I still can't be completely sure everything is correct. Also, there was the occasional HTML that couldn't be equivalently converted into StructuredText, so I had to leave those cases alone, escaping out to HTML in my markup. So, I at least converted those instances to XHTML, but those bits of XHTML may come back to bite me the next time web standards change.
- All of my weblog entries (where applicable) have been converted from pre-unicode encodings (ISO-8859-1, but usually Windows-1252) to UTF-8.
- Same goes for all of my weblog comments.
- My pages are served as XHTML 1.1 with a content type of application/xhtml+xml for user-agents that support it, falling back to text/xml and then text/html for those that don't.
- The appropriate language tags are output (xml:lang or just lang) based on what content-type is served.
Now for some code. First, here's how I change what content-type I send depending on the user-agent's accept header (reformatted slightly):
<?php
$accept = isset($_SERVER['HTTP_ACCEPT']) ?
array_map('trim',explode(',',$_SERVER['HTTP_ACCEPT'])) :
array();
$served_as_xml = true;
if(in_array('application/xhtml+xml', $accept)){
$content_type = 'application/xhtml+xml';
}elseif(in_array('text/xml', $accept)){
$content_type = 'text/xml';
}else{
$content_type = 'text/html';
$served_as_xml = false;
}
header("Content-type: $content_type; charset=utf-8");
?>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"<?php if(!$served_as_xml){?> lang="en"<?php } ?>>
<head>
<meta http-equiv="Content-Type" content="<?php echo $content_type;?>; charset=utf-8" />
...etc. Note, also, if you do this yourself, that XML requires that the XML prolog be at the very beginning of the document. If you have any blank lines or spaces in front of it, parsing will fail. Also, supposedly some browsers have problems if you include the XML prolog. I tested Mozilla, IE6, and Opera (7.5) and didn't notice any problems, but if you do please let me know.
That was the easy part. Unicode conversion was much more of a pain. PHP's built-in Unicode support is almost non-existent. They have a utf8_encode function, but that doesn't handle Windows-1252 correctly, so I had to do something else. Thankfully, it turns out my host had PHP's mb extension installed, and I was able to use mb_convert_encoding to handle Windows-1252.
Also, the manual page for utf8_encode had this great little funtion called "seems_utf8" listed in one of the comments that was a life saver. Because some of my character data was inconsistent (some was in Unicode already), I couldn't just blindly convert from Windows-1252 to UTF-8 (also because to-utf8(to-utf8(text)) != to-utf8(text)). That function does a great job of figuring out whether character data you have is already in UTF-8 so I was able to safely convert to UTF-8.
Here's a bunch of code that shows how I converted everything:
<?php
function convert_utf8($str){
if(!seems_utf8($str))
return mb_convert_encoding($str, 'UTF-8', 'Windows-1252');
return $str;
}
function seems_utf8($Str) {
for ($i=0; $i<strlen($Str); $i++) {
if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
$sql = 'SELECT [weblog entries and then comments]';
$rs_id = mysql_query($sql, $conn) or die("Couldn't get weblog entries: ".mysql_error());
$array = &slurpResultSet($rs_id);
foreach($array as $entry){
$e = $entry['Comment_Text'];
$utf8 = convert_utf8($e);
if($e != $utf8){
#run SQL to update the record with the UTF-8 text, etc.
}
}
?>
I might as well give you slurpResultSet() while I'm here:
<?php
function & slurpResultset($rs_id, $field = NULL){
$foo = array();
while($row = mysql_fetch_assoc($rs_id)){
$foo[] = $field ? $row[$field] : $row;
}
return $foo;
}
?>
I ran all that through all of my weblog entries and all of my comments and it seems to have worked great. It's neat to see question marks turn into real characters upon conversion.
As an aside, character coding is one of the classic "leaky" things you have to deal with. When I changed my site character encoding from ISO-8859-1 to UTF-8, all kinds of things came out as question marks, which was expected. But what I didn't expect is that, because data now comes into my site as Unicode from browsers because that's what I serve, anywhere I send that data also gets sent Unicode. So, for instance, mail that I get sent from my site through my mail form or when I get a comment is Unicode, and it turned out my mail client didn't handle it correctly. So, now I have to stick a Unicode charset header on e-mails I generate.
So, now I have one of the probably very few sites on the Internet that serve XHTML 1.1 with Unicode encoding and appropriate content-types. There's a reason for that. It's hard to do. I noticed Mark Pilgrim reverted his site back to HTML 4.01. That's probably wise. What is neat, however, is that now when I re-implement my search-term highlighter I'll get to do it with a real XML parser rather than fudging it with regular expressions.
Oh yeah: Iñtërnâtiônàlizætiøn 
Also, this UTF-8 tool as well as FileFormat.Info were very useful in helping me figure out what the heck kind of characters were lurking in my content so that I could figure out what to do with them. Sam Ruby's guide to i18n was helpful too.
Update: Check out the Unicode support in action.
Update (Jun 4): I made a mistake in following the language declaration guidelines. I followed the rules for XHTML 1.0, not for XHTML 1.1. In XHTML 1.0, you should specify either 'XML:lang' or 'lang' on the <html> element depending on whether you serve the page as XML or HTML. In XHTML 1.1 you should always use only 'XML:lang'. Corrected on my site, but not in the source above.
Update (Dec 5): I figure I'll post my current XHTML header here. It's shorter and more correct than the one above (since it has that xml:lang correction and won't get stymied by q values).
<?php
if(!isset($_SERVER['HTTP_ACCEPT'])) $content_type = 'text/html';
else{ #maybe handle "q" values later
foreach(array('application/xhtml+xml','text/xml') as $content_type)
if(($pos = strpos($_SERVER['HTTP_ACCEPT'],$content_type)) !== false) break;
if($pos === false) $content_type = 'text/html';
}
header("Content-type: $content_type; charset=UTF-8");
if($content_type != 'text/html'){?><?xml version="1.0" encoding="UTF-8"?><?php }?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="<?php echo $content_type?>; charset=UTF-8" />
<!-- etc. -->
I hate ASP.NET
I hate ASP... I was doing wonderswith PHP, then suddenly one of myclients...
Johnies: Mar 17, 6:14am