KBD

Keith Devens .com

Saturday, July 4, 2009 Flag waving
I have the strength of a bear... that has the strength of two bears! – Sealab 2021
← I may decide to check out Wax soonDemocrats want to take your money →

Daily link icon Tuesday, June 29, 2004

A regular expression to check for valid UTF-8

This is great. A regular expression that allows you to check if text is valid UTF-8. Via Sam Ruby.

I'd previously used a function I found in the PHP manual and reproduced here. I like the regular expression better for aesthetic reasons, because it appears more straightforward (and therefore it's easier to verify its correctness), and because it's easier to port to any language you need. It's possibly even faster than the other code. I'll reproduce it here in case that page ever goes away:

$field =~
  m/^(
     [\09\0A\0D\x20-\x7E]               # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$/x;

Excellent.

Update: Making that a non-capturing group would probably be better.

← I may decide to check out Wax soonDemocrats want to take your money →

Comments XML gif

Aidan Kehoe (http://www.parhasard.net/) wrote:

if (@iconv($input, 'UTF-8', 'UTF-8') == $input) echo "Good UTF-8!";
else echo "Nope." 
is much, much faster, if iconv() is available.
∴ Aidan Kehoe | 30-Jun-2004 8:02am est | http://www.parhasard.net/ | #4893

Micah (http://wubi.org/) wrote:

More regexps for Asian character-set encodings in this paper:

http://examples.oreilly.com/cjkvinfo/perl/svpm99-paper.pdf

∴ Micah | 30-Jun-2004 10:34am est | http://wubi.org/ | #4894

Rodent of Unusual Size (http://Ken.Coar.Org/burrow/) wrote:

actually, after looking at rfc3629 and talking to sam ruby, it appears that this regex is, in fact, not valid. the rfc specifies that octets 0x00-0x7F are valid, being identical to their ASCII counterparts, but the regex excludes much of that range. it looks as though the utf-8 definition was conflated with the xml 1.0 character list to the detriment of the stated purpose ('is this valid utf-8?' determination).

∴ Rodent of Unusual Size | 19-Aug-2004 10:59am est | http://Ken.Coar.Org/burrow/ | #5302

Keith (http://keithdevens.com/) wrote:

Thanks for pointing that out Ken! I'd wondered about that myself but didn't think too much about it. You're probably right that they conflated the XML 1.0 character definition with the definition of UTF-8.

Keith | 19-Aug-2004 11:20am est | http://keithdevens.com/ | #5303

Andy wrote:

Quick Q if someone can tell me. Is  valid utf-8 or must it be   ? Trying to insert XML (unencoded) into a DB that was reportedly set for UTF-8. It is failing and I am trying to understand if I need to encode or the DBA should fix the DB.

∴ Andy | 20-May-2008 4:36pm est | #10680

Keith (http://keithdevens.com/) wrote:

Run the above regular expression against your XML to see if the  is actually encoded in UTF-8 or whether it's Latin-1 or something. If the db is supposed to accept UTF-8 and your XML passes the regex, then the db is wrong. More likely, your XML isn't actually encoded in UTF-8.

Keith | 20-May-2008 11:13pm est | http://keithdevens.com/ | #10681

Al wrote:

"Confalted" -- cool learned a new word Smiley
Actually came for the regex(very useful thanks!)

∴ Al | 25-Jun-2009 2:29pm est | #11191

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

July 2009
SunMonTueWedThuFriSat
 1234
567891011
12131415161718
19202122232425
262728293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 3 posts

Recent comments XML

new⇒Wizard's First Rule

> while it is cheesy to some​extent, I actually found it to be​pretty enjoy...

Keith: Jul 3, 6:33pm

I hate Norton Antivirus

I bought Norton 2009 and it is not​installing on my computer!!!
It​seems l...

o'neil: Jun 30, 11:44am

Generated in about 0.19s.

(Used 8 db queries)