Keith Devens .com |
Thursday, May 15, 2008 | ![]() |
| I meant what I said, and I said what I meant. An elephant's faithful, one hundred percent! – Horton (Horton Hears a Who, Dr. Seuss) | ||
|
| ← I may decide to check out Wax soon | Democrats want to take your money → |

Aidan Kehoe (http://www.parhasard.net/) wrote:
Micah (http://wubi.org/) wrote:
More regexps for Asian character-set encodings in this paper:
Rodent of Unusual Size (http://Ken.Coar.Org/burrow/) wrote:
actually, after looking at rfc3629 and talking to sam ruby, it appears that this regex is, in fact, not valid. the rfc specifies that octets 0x00-0x7F are valid, being identical to their ASCII counterparts, but the regex excludes much of that range. it looks as though the utf-8 definition was conflated with the xml 1.0 character list to the detriment of the stated purpose ('is this valid utf-8?' determination).
Keith (http://keithdevens.com/) wrote:
Thanks for pointing that out Ken! I'd wondered about that myself but didn't think too much about it. You're probably right that they conflated the XML 1.0 character definition with the definition of UTF-8.
Feel free to post a comment below. Please see my comment policy.
Formatting Rules (No HTML):
Generated in about 0.231s.
(Used 8 db queries)
if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!";
is much, much faster, if iconv() is available.else echo "Nope."