Keith Devens .com |
Saturday, July 4, 2009 | ![]() |
| I have the strength of a bear... that has the strength of two bears! – Sealab 2021 | ||
|
| ← I may decide to check out Wax soon | Democrats want to take your money → |

Aidan Kehoe (http://www.parhasard.net/) wrote:
Micah (http://wubi.org/) wrote:
More regexps for Asian character-set encodings in this paper:
Rodent of Unusual Size (http://Ken.Coar.Org/burrow/) wrote:
actually, after looking at rfc3629 and talking to sam ruby, it appears that this regex is, in fact, not valid. the rfc specifies that octets 0x00-0x7F are valid, being identical to their ASCII counterparts, but the regex excludes much of that range. it looks as though the utf-8 definition was conflated with the xml 1.0 character list to the detriment of the stated purpose ('is this valid utf-8?' determination).
Keith (http://keithdevens.com/) wrote:
Thanks for pointing that out Ken! I'd wondered about that myself but didn't think too much about it. You're probably right that they conflated the XML 1.0 character definition with the definition of UTF-8.
Andy wrote:
Quick Q if someone can tell me. Is  valid utf-8 or must it be   ? Trying to insert XML (unencoded) into a DB that was reportedly set for UTF-8. It is failing and I am trying to understand if I need to encode or the DBA should fix the DB.
Keith (http://keithdevens.com/) wrote:
Run the above regular expression against your XML to see if the  is actually encoded in UTF-8 or whether it's Latin-1 or something. If the db is supposed to accept UTF-8 and your XML passes the regex, then the db is wrong. More likely, your XML isn't actually encoded in UTF-8.
Al wrote:
"Confalted" -- cool learned a new word 
Actually came for the regex(very useful thanks!)
Feel free to post a comment below. Please see my comment policy.
Formatting Rules (No HTML):
Generated in about 0.19s.
(Used 8 db queries)
if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!";
is much, much faster, if iconv() is available.else echo "Nope."