KBD

Keith Devens .com

Saturday, March 20, 2010 Flag waving
And if you go too far up, abstraction-wise, you run out of oxygen. Sometimes smart thinkers just don't... – Joel Spolsky

Tag: Unicode

Parents:

Daily link icon Monday, December 4, 2006

  1. Unicode character information (via Simon). Also see fileformat.info

       (0) Tags: [Unicode]

Daily link icon Thursday, May 18, 2006

  1. Symmetric Difference -- From MathWorld. A ⊖ B ⇔ (A-B) ∪ (B-A). I wasn't sure if I was using the phrase "symmetric difference" correctly, so I checked and I was. I added the circled minus to my cut and paste list of fancy characters.

       (0) Tags: [Mathematics, Unicode]

Daily link icon Friday, November 4, 2005

Sam Ruby: Sometimes the dragon wins

Sam Ruby: Sometimes the dragon wins. Not here: ɥɦɐ. And you can search for it.

My dirty little secret, however, is that I'm storing everything in MySQL in fields declared to be encoded in latin1, storing UTF-8 in there anyway, and trusting the browser to get Unicode right. I've tried storing everything in Unicode-encoded fields in MySQL, but it mangled some of the rarer characters when I tried to convert this post once my host upgraded to MySQL 4.0 or 4.1 (whichever one first supported Unicode fields). Some characters worked, some didn't. My only guess is that MySQL only handled an older version of Unicode and stripped everything it didn't understand.

Daily link icon Friday, November 19, 2004

Cut and paste list for fancy characters

Whenever I want a special character I always have to go to the Windows character map, or find some Unicode reference online. FileFormat.Info is a fantastic resource when it's up... but, it's not. I've been meaning to do this for a while: I'm starting a list of all non-ASCII characters I ever use so that I can cut and paste them from here.

Accents: ÀàÁáÂâÃãÄäÅ寿ŒœÇçÈèÉéÊêËëÌìÍíÎîÏïÑñÒòÓóÔôÕõÖöÙùÚúÛûÜüŸÿÝýŠšÞþÐß

Quotes/brackets/punctuation: ‘(lq)’(rq)“(lqq)”(rqq)❛❜❝❞‹›«»〈〉…•(bullet)†‡§¶–(en)—(em)‰

Math/logic:
∀∃□◇¬→↔∨∧⇒⇔∴±×÷⋅(dot)⊥∅⊖⊂⊆⊄∉∈∋∌⊅⊇⊃∝∞∪∩∼≈≡≠≤≥≪≫⌈⌉⌊⌋
⅛¼⅓⅜½⅝⅔¾⅞°(deg)⁰¹²³⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉ℤℝℜµ∫√∏∑∂№‰

Legal/currency: ™©®¥£€¤¢

Foreign language: ¡¿ΩωλΘ(upper)θ(lower)ΦπΔαβγ

Other:
☺☻♀♂℞☤☠☢★☆☯☮☧✓✔❤♪♫♭♮♯
♔♕♖♗♘♙♚♛♜♝♞♟♡♢♧♤♥♦♣♠
↲←↑↓

This resource was useful as well, and here's a list of HTML entities. Here's FileFormat.info's dump of Unicode's 'mathematical operator's' block.

Daily link icon Saturday, May 15, 2004

My site is now fully unicode-ized and xhtml-ized

I just finished everything I said I'd do a few days ago.

  • All of my old entries that had been written prior to me using StructuredText on my blog have been converted from HTML to StructuredText. Much of it I was able to automate, but parts of it I had to do manually. I converted over 1700 posts. I still can't be completely sure everything is correct. Also, there was the occasional HTML that couldn't be equivalently converted into StructuredText, so I had to leave those cases alone, escaping out to HTML in my markup. So, I at least converted those instances to XHTML, but those bits of XHTML may come back to bite me the next time web standards change.
  • All of my weblog entries (where applicable) have been converted from pre-unicode encodings (ISO-8859-1, but usually Windows-1252) to UTF-8.
  • Same goes for all of my weblog comments.
  • My pages are served as XHTML 1.1 with a content type of application/xhtml+xml for user-agents that support it, falling back to text/xml and then text/html for those that don't.
  • The appropriate language tags are output (xml:lang or just lang) based on what content-type is served.

Now for some code. First, here's how I change what content-type I send depending on the user-agent's accept header (reformatted slightly):

<?php
$accept 
= isset($_SERVER['HTTP_ACCEPT']) ?
    
array_map('trim',explode(',',$_SERVER['HTTP_ACCEPT'])) :
    array();
$served_as_xml true;
if(
in_array('application/xhtml+xml'$accept)){
    
$content_type 'application/xhtml+xml';
}elseif(
in_array('text/xml'$accept)){
    
$content_type 'text/xml';
}else{
    
$content_type 'text/html';
    
$served_as_xml false;
}
header("Content-type: $content_type; charset=utf-8");
?>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"<?php if(!$served_as_xml){?> lang="en"<?php ?>>
<head>
    <meta http-equiv="Content-Type" content="<?php echo $content_type;?>; charset=utf-8" />

...etc. Note, also, if you do this yourself, that XML requires that the XML prolog be at the very beginning of the document. If you have any blank lines or spaces in front of it, parsing will fail. Also, supposedly some browsers have problems if you include the XML prolog. I tested Mozilla, IE6, and Opera (7.5) and didn't notice any problems, but if you do please let me know.

That was the easy part. Unicode conversion was much more of a pain. PHP's built-in Unicode support is almost non-existent. They have a utf8_encode function, but that doesn't handle Windows-1252 correctly, so I had to do something else. Thankfully, it turns out my host had PHP's mb extension installed, and I was able to use mb_convert_encoding to handle Windows-1252.

Also, the manual page for utf8_encode had this great little funtion called "seems_utf8" listed in one of the comments that was a life saver. Because some of my character data was inconsistent (some was in Unicode already), I couldn't just blindly convert from Windows-1252 to UTF-8 (also because to-utf8(to-utf8(text)) != to-utf8(text)). That function does a great job of figuring out whether character data you have is already in UTF-8 so I was able to safely convert to UTF-8.

Here's a bunch of code that shows how I converted everything:

<?php
function convert_utf8($str){
    if(!
seems_utf8($str))
        return 
mb_convert_encoding($str'UTF-8''Windows-1252');
    return 
$str;
}
function 
seems_utf8($Str) {
 for (
$i=0$i<strlen($Str); $i++) {
  if (
ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
  
elseif ((ord($Str[$i]) & 0xE0) == 0xC0$n=1# 110bbbbb
  
elseif ((ord($Str[$i]) & 0xF0) == 0xE0$n=2# 1110bbbb
  
elseif ((ord($Str[$i]) & 0xF8) == 0xF0$n=3# 11110bbb
  
elseif ((ord($Str[$i]) & 0xFC) == 0xF8$n=4# 111110bb
  
elseif ((ord($Str[$i]) & 0xFE) == 0xFC$n=5# 1111110b
  
else return false# Does not match any model
  
for ($j=0$j<$n$j++) { # n bytes matching 10bbbbbb follow ?
   
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
   return 
false;
  }
 }
 return 
true;
}

$sql 'SELECT [weblog entries and then comments]';
$rs_id mysql_query($sql$conn) or die("Couldn't get weblog entries: ".mysql_error());
$array = &slurpResultSet($rs_id);
foreach(
$array as $entry){
    
$e $entry['Comment_Text'];
    
$utf8 convert_utf8($e);
    if(
$e != $utf8){
        
#run SQL to update the record with the UTF-8 text, etc.
    
}
}
?>

I might as well give you slurpResultSet() while I'm here:

<?php
function & slurpResultset($rs_id$field NULL){
    
$foo = array();
    while(
$row mysql_fetch_assoc($rs_id)){
        
$foo[] = $field $row[$field] : $row;
    }
    return 
$foo;
}
?>

I ran all that through all of my weblog entries and all of my comments and it seems to have worked great. It's neat to see question marks turn into real characters upon conversion.

As an aside, character coding is one of the classic "leaky" things you have to deal with. When I changed my site character encoding from ISO-8859-1 to UTF-8, all kinds of things came out as question marks, which was expected. But what I didn't expect is that, because data now comes into my site as Unicode from browsers because that's what I serve, anywhere I send that data also gets sent Unicode. So, for instance, mail that I get sent from my site through my mail form or when I get a comment is Unicode, and it turned out my mail client didn't handle it correctly. So, now I have to stick a Unicode charset header on e-mails I generate.

So, now I have one of the probably very few sites on the Internet that serve XHTML 1.1 with Unicode encoding and appropriate content-types. There's a reason for that. It's hard to do. I noticed Mark Pilgrim reverted his site back to HTML 4.01. That's probably wise. What is neat, however, is that now when I re-implement my search-term highlighter I'll get to do it with a real XML parser rather than fudging it with regular expressions.

Oh yeah: Iñtërnâtiônàlizætiøn Smiley

Also, this UTF-8 tool as well as FileFormat.Info were very useful in helping me figure out what the heck kind of characters were lurking in my content so that I could figure out what to do with them. Sam Ruby's guide to i18n was helpful too.

Update: Check out the Unicode support in action.

Update (Jun 4): I made a mistake in following the language declaration guidelines. I followed the rules for XHTML 1.0, not for XHTML 1.1. In XHTML 1.0, you should specify either 'XML:lang' or 'lang' on the <html> element depending on whether you serve the page as XML or HTML. In XHTML 1.1 you should always use only 'XML:lang'. Corrected on my site, but not in the source above.

Update (Dec 5): I figure I'll post my current XHTML header here. It's shorter and more correct than the one above (since it has that xml:lang correction and won't get stymied by q values).

<?php
if(!isset($_SERVER['HTTP_ACCEPT'])) $content_type 'text/html';
else{ 
#maybe handle "q" values later
    
foreach(array('application/xhtml+xml','text/xml') as $content_type)
        if((
$pos strpos($_SERVER['HTTP_ACCEPT'],$content_type)) !== false) break;
    if(
$pos === false$content_type 'text/html';
}
header("Content-type: $content_type; charset=UTF-8");
if(
$content_type != 'text/html'){?><?xml version="1.0" encoding="UTF-8"?><?php }?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
    <meta http-equiv="Content-Type" content="<?php echo $content_type?>; charset=UTF-8" />
<!-- etc. -->

Daily link icon Friday, October 10, 2003

Joel Spolsky on Unicode and Character Sets

Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

March 2010
SunMonTueWedThuFriSat
 123456
78910111213
14151617181920
21222324252627
28293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 2 posts

Recent comments XML

I hate ASP.NET

I hate ASP... I was doing wonders​with PHP, then suddenly one of my​clients...

Johnies: Mar 17, 6:14am

Quantum physics and free will

I knew you were going to say that....

Tom Massey: Mar 15, 9:26pm

Generated in about 0.26s.

(Used 10 db queries)