KBD

Keith Devens .com

Saturday, July 4, 2009 Flag waving
No battle plan ever survives contact with the enemy. – Field Marshall Helmuth Carl Bernard von Moltke
← Web specifications have become too complexUnicode testing →

Daily link icon Saturday, May 15, 2004

My site is now fully unicode-ized and xhtml-ized

I just finished everything I said I'd do a few days ago.

  • All of my old entries that had been written prior to me using StructuredText on my blog have been converted from HTML to StructuredText. Much of it I was able to automate, but parts of it I had to do manually. I converted over 1700 posts. I still can't be completely sure everything is correct. Also, there was the occasional HTML that couldn't be equivalently converted into StructuredText, so I had to leave those cases alone, escaping out to HTML in my markup. So, I at least converted those instances to XHTML, but those bits of XHTML may come back to bite me the next time web standards change.
  • All of my weblog entries (where applicable) have been converted from pre-unicode encodings (ISO-8859-1, but usually Windows-1252) to UTF-8.
  • Same goes for all of my weblog comments.
  • My pages are served as XHTML 1.1 with a content type of application/xhtml+xml for user-agents that support it, falling back to text/xml and then text/html for those that don't.
  • The appropriate language tags are output (xml:lang or just lang) based on what content-type is served.

Now for some code. First, here's how I change what content-type I send depending on the user-agent's accept header (reformatted slightly):

<?php
$accept 
= isset($_SERVER['HTTP_ACCEPT']) ?
    
array_map('trim',explode(',',$_SERVER['HTTP_ACCEPT'])) :
    array();
$served_as_xml true;
if(
in_array('application/xhtml+xml'$accept)){
    
$content_type 'application/xhtml+xml';
}elseif(
in_array('text/xml'$accept)){
    
$content_type 'text/xml';
}else{
    
$content_type 'text/html';
    
$served_as_xml false;
}
header("Content-type: $content_type; charset=utf-8");
?>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"<?php if(!$served_as_xml){?> lang="en"<?php ?>>
<head>
    <meta http-equiv="Content-Type" content="<?php echo $content_type;?>; charset=utf-8" />

...etc. Note, also, if you do this yourself, that XML requires that the XML prolog be at the very beginning of the document. If you have any blank lines or spaces in front of it, parsing will fail. Also, supposedly some browsers have problems if you include the XML prolog. I tested Mozilla, IE6, and Opera (7.5) and didn't notice any problems, but if you do please let me know.

That was the easy part. Unicode conversion was much more of a pain. PHP's built-in Unicode support is almost non-existent. They have a utf8_encode function, but that doesn't handle Windows-1252 correctly, so I had to do something else. Thankfully, it turns out my host had PHP's mb extension installed, and I was able to use mb_convert_encoding to handle Windows-1252.

Also, the manual page for utf8_encode had this great little funtion called "seems_utf8" listed in one of the comments that was a life saver. Because some of my character data was inconsistent (some was in Unicode already), I couldn't just blindly convert from Windows-1252 to UTF-8 (also because to-utf8(to-utf8(text)) != to-utf8(text)). That function does a great job of figuring out whether character data you have is already in UTF-8 so I was able to safely convert to UTF-8.

Here's a bunch of code that shows how I converted everything:

<?php
function convert_utf8($str){
    if(!
seems_utf8($str))
        return 
mb_convert_encoding($str'UTF-8''Windows-1252');
    return 
$str;
}
function 
seems_utf8($Str) {
 for (
$i=0$i<strlen($Str); $i++) {
  if (
ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
  
elseif ((ord($Str[$i]) & 0xE0) == 0xC0$n=1# 110bbbbb
  
elseif ((ord($Str[$i]) & 0xF0) == 0xE0$n=2# 1110bbbb
  
elseif ((ord($Str[$i]) & 0xF8) == 0xF0$n=3# 11110bbb
  
elseif ((ord($Str[$i]) & 0xFC) == 0xF8$n=4# 111110bb
  
elseif ((ord($Str[$i]) & 0xFE) == 0xFC$n=5# 1111110b
  
else return false# Does not match any model
  
for ($j=0$j<$n$j++) { # n bytes matching 10bbbbbb follow ?
   
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
   return 
false;
  }
 }
 return 
true;
}

$sql 'SELECT [weblog entries and then comments]';
$rs_id mysql_query($sql$conn) or die("Couldn't get weblog entries: ".mysql_error());
$array = &slurpResultSet($rs_id);
foreach(
$array as $entry){
    
$e $entry['Comment_Text'];
    
$utf8 convert_utf8($e);
    if(
$e != $utf8){
        
#run SQL to update the record with the UTF-8 text, etc.
    
}
}
?>

I might as well give you slurpResultSet() while I'm here:

<?php
function & slurpResultset($rs_id$field NULL){
    
$foo = array();
    while(
$row mysql_fetch_assoc($rs_id)){
        
$foo[] = $field $row[$field] : $row;
    }
    return 
$foo;
}
?>

I ran all that through all of my weblog entries and all of my comments and it seems to have worked great. It's neat to see question marks turn into real characters upon conversion.

As an aside, character coding is one of the classic "leaky" things you have to deal with. When I changed my site character encoding from ISO-8859-1 to UTF-8, all kinds of things came out as question marks, which was expected. But what I didn't expect is that, because data now comes into my site as Unicode from browsers because that's what I serve, anywhere I send that data also gets sent Unicode. So, for instance, mail that I get sent from my site through my mail form or when I get a comment is Unicode, and it turned out my mail client didn't handle it correctly. So, now I have to stick a Unicode charset header on e-mails I generate.

So, now I have one of the probably very few sites on the Internet that serve XHTML 1.1 with Unicode encoding and appropriate content-types. There's a reason for that. It's hard to do. I noticed Mark Pilgrim reverted his site back to HTML 4.01. That's probably wise. What is neat, however, is that now when I re-implement my search-term highlighter I'll get to do it with a real XML parser rather than fudging it with regular expressions.

Oh yeah: Iñtërnâtiônàlizætiøn Smiley

Also, this UTF-8 tool as well as FileFormat.Info were very useful in helping me figure out what the heck kind of characters were lurking in my content so that I could figure out what to do with them. Sam Ruby's guide to i18n was helpful too.

Update: Check out the Unicode support in action.

Update (Jun 4): I made a mistake in following the language declaration guidelines. I followed the rules for XHTML 1.0, not for XHTML 1.1. In XHTML 1.0, you should specify either 'XML:lang' or 'lang' on the <html> element depending on whether you serve the page as XML or HTML. In XHTML 1.1 you should always use only 'XML:lang'. Corrected on my site, but not in the source above.

Update (Dec 5): I figure I'll post my current XHTML header here. It's shorter and more correct than the one above (since it has that xml:lang correction and won't get stymied by q values).

<?php
if(!isset($_SERVER['HTTP_ACCEPT'])) $content_type 'text/html';
else{ 
#maybe handle "q" values later
    
foreach(array('application/xhtml+xml','text/xml') as $content_type)
        if((
$pos strpos($_SERVER['HTTP_ACCEPT'],$content_type)) !== false) break;
    if(
$pos === false$content_type 'text/html';
}
header("Content-type: $content_type; charset=UTF-8");
if(
$content_type != 'text/html'){?><?xml version="1.0" encoding="UTF-8"?><?php }?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
    <meta http-equiv="Content-Type" content="<?php echo $content_type?>; charset=UTF-8" />
<!-- etc. -->
← Web specifications have become too complexUnicode testing →

Comments XML gif

andrin wrote:

Hello Keith!

This was very interesting reading and I have given this a serious thought and I will now do the same on my site. I only wonder if there are any specific reason to explode the HTTP_ACCEPT string and the check for values with in_array istead of using strpos? My HTTP_ACCEPT looks like this: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,[wrapped -- ed]
text/plain;q=0.8,video/x-mng,image/png,image/jpeg,image/gif;q=0.2,/;q=0.1
and as you se there is semi colons as well so exploding this and picking the first four would give me:
[0] => text/xml
[1] => application/xml
[2] => application/xhtml+xml
[3] => text/html;q=0.9
Imagine if it the UA had switched place on text/html and application/xhtml+xml then the first four would yield:
[0] => text/xml
[1] => application/xml
[2] => text/html
[3] => application/xhtml+xml;q=0.9
Wouldn´t in_array fail to find the application/xhtml+xml here? It would of course fall back on the bullet proof text/html but I would only like to get your heads up on this issue, if it´s now is an issue ;-)
As I asked earlier: If there is any good reason to explode it and do the search in the array, please let me know because I think it´s a brilliant solution.
By the way. I just finished reading some WDs from the W3Cs site and they have a really good document if you haven´t read it yet give it (http://www.w3.org/TR/i18n-html-tech-lang/) a glance.

Cheers

∴ andrin | 22-May-2004 1:38pm est | #4649

Keith (http://keithdevens.com/) wrote:

Andrin, that's a good point. I was aware of the issue but chose to ignore it because I don't believe it will come up often in practice (since if anything, text/html will be the content-type listed with reduced quality, not application/xhtml+xml), and because I default to text/html anyway which will work fine as a fallback.

The main reason I did it the way I did was just personal preference. I may choose to improve the parser to take care of the issue you highlighted.

By the way, that i18n document from the W3C you linked to I already linked to on the post I referenced from this post. Thanks though.

Keith | 23-May-2004 4:49am est | http://keithdevens.com/ | #4653

andrin wrote:

Yeah I saw that you linked to the document I mentioned later when I was scavenging the rest of your site Smiley winking There sure is a lot to read. Keept me up all night Smiley

∴ andrin | 23-May-2004 5:51pm est | #4664

Jim wrote:

You can pass "auto" as the encoding to mb_convert_encoding(), which means you didn't have to use the UTF-8 detector.

∴ Jim | 20-Jul-2004 1:55pm est | #5046

Keith (http://keithdevens.com/) wrote:

I think I remember trying that and it not doing what I wanted. Thanks for the note though. If I ever need to do something like this again I'll make sure to try it.

Keith | 20-Jul-2004 5:22pm est | http://keithdevens.com/ | #5048

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

July 2009
SunMonTueWedThuFriSat
 1234
567891011
12131415161718
19202122232425
262728293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 3 posts

Recent comments XML

new⇒Wizard's First Rule

> while it is cheesy to some​extent, I actually found it to be​pretty enjoy...

Keith: Jul 3, 6:33pm

I hate Norton Antivirus

I bought Norton 2009 and it is not​installing on my computer!!!
It​seems l...

o'neil: Jun 30, 11:44am

Generated in about 0.247s.

(Used 8 db queries)