KBD

Keith Devens .com

Thursday, May 15, 2008 Flag waving
I meant what I said, and I said what I meant. An elephant's faithful, one hundred percent! – Horton (Horton Hears a Who, Dr. Seuss)
← I may decide to check out Wax soonDemocrats want to take your money →

Daily link icon Tuesday, June 29, 2004

A regular expression to check for valid UTF-8

This is great. A regular expression that allows you to check if text is valid UTF-8. Via Sam Ruby.

I'd previously used a function I found in the PHP manual and reproduced here. I like the regular expression better for aesthetic reasons, because it appears more straightforward (and therefore it's easier to verify its correctness), and because it's easier to port to any language you need. It's possibly even faster than the other code. I'll reproduce it here in case that page ever goes away:

$field =~
  m/^(
     [\09\0A\0D\x20-\x7E]               # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$/x;

Excellent.

Update: Making that a non-capturing group would probably be better.

← I may decide to check out Wax soonDemocrats want to take your money →

Comments XML gif

Aidan Kehoe (http://www.parhasard.net/) wrote:

if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!";
else echo "Nope." 
is much, much faster, if iconv() is available.
∴ Aidan Kehoe | 30-Jun-2004 8:02am est | http://www.parhasard.net/ | #4893

Micah (http://wubi.org/) wrote:

More regexps for Asian character-set encodings in this paper:

http://examples.oreilly.com/cjkvinfo/perl/svpm99-paper.pdf

∴ Micah | 30-Jun-2004 10:34am est | http://wubi.org/ | #4894

Rodent of Unusual Size (http://Ken.Coar.Org/burrow/) wrote:

actually, after looking at rfc3629 and talking to sam ruby, it appears that this regex is, in fact, not valid. the rfc specifies that octets 0x00-0x7F are valid, being identical to their ASCII counterparts, but the regex excludes much of that range. it looks as though the utf-8 definition was conflated with the xml 1.0 character list to the detriment of the stated purpose ('is this valid utf-8?' determination).

∴ Rodent of Unusual Size | 19-Aug-2004 10:59am est | http://Ken.Coar.Org/burrow/ | #5302

Keith (http://keithdevens.com/) wrote:

Thanks for pointing that out Ken! I'd wondered about that myself but didn't think too much about it. You're probably right that they conflated the XML 1.0 character definition with the definition of UTF-8.

Keith | 19-Aug-2004 11:20am est | http://keithdevens.com/ | #5303

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

May 2008
SunMonTueWedThuFriSat
 123
45678910
11121314151617
18192021222324
25262728293031



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 2 posts

Recent comments XML

"IMDB for music"

IMDB for Music? It looks to be a​couple of years old...​http://MusicTell.co...

Ken Empie: May 14, 9:57pm

Girls, please don't get breast implants

I looked like a 12 yr old boys body​before I got mine and I cant be​happier...

Mel: May 9, 2:57pm

Generated in about 0.231s.

(Used 8 db queries)