<?xml version="1.0" ?>
<rss version="2.0">
	<channel>
		<title>Keith's Weblog: Comments on &quot;A regular expression to check for valid UTF-8&quot;</title>
		<description>Keith's Weblog: Comments on &quot;A regular expression to check for valid UTF-8&quot;, posted on June 29, 2004</description>
		<link>http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex</link>

		<category>Programming</category>
		<language>en-us</language>
		<image>
			<link>http://keithdevens.com/weblog</link>
			<title>Keith Devens .com</title>
			<url>http://keithdevens.com/images/kbd.gif</url>
		</image>

		<item>
			<title>by Aidan Kehoe</title>
			<link>http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex#comment4893</link>
			<guid isPermaLink="false">http://keithdevens.com/weblog/4982#comment4893</guid>
			<pubDate>Wed, 30 Jun 2004 13:02:37 +0000</pubDate>
			<description>&lt;code&gt;if&amp;#160;(@iconv($input,&amp;#160;'UTF-8',&amp;#160;'UTF-8')&amp;#160;==&amp;#160;$input)&amp;#160;echo&amp;#160;&amp;quot;Good&amp;#160;UTF-8!&amp;quot;;&lt;br /&gt;
else&amp;#160;echo&amp;#160;&amp;quot;Nope.&amp;quot;&amp;#160;&lt;br /&gt;
&lt;/code&gt;is much, much faster, if iconv() is available. </description>
		</item>
		<item>
			<title>by Micah</title>
			<link>http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex#comment4894</link>
			<guid isPermaLink="false">http://keithdevens.com/weblog/4982#comment4894</guid>
			<pubDate>Wed, 30 Jun 2004 15:34:12 +0000</pubDate>
			<description>&lt;p class=&quot;st-markup&quot;&gt;More regexps for Asian character-set encodings in this paper:&lt;/p&gt;

&lt;p class=&quot;st-markup&quot;&gt;&lt;a href=&quot;http://examples.oreilly.com/cjkvinfo/perl/svpm99-paper.pdf&quot;&gt;http://examples.oreilly.com/cjkvinfo/perl/svpm99-paper.pdf&lt;/a&gt;&lt;/p&gt;

</description>
		</item>
		<item>
			<title>by Rodent of Unusual Size</title>
			<link>http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex#comment5302</link>
			<guid isPermaLink="false">http://keithdevens.com/weblog/4982#comment5302</guid>
			<pubDate>Thu, 19 Aug 2004 15:59:47 +0000</pubDate>
			<description>&lt;p class=&quot;st-markup&quot;&gt;actually, after looking at &lt;a href=&quot;http://www.faqs.org/rfcs/rfc3629.html&quot;&gt;rfc3629&lt;/a&gt; and talking to sam ruby, it appears that this regex is, in fact, &lt;strong&gt;not&lt;/strong&gt; valid.  the rfc specifies that octets 0x00-0x7F are valid, being identical to their ASCII counterparts, but the regex excludes much of that range.  it looks as though the utf-8 definition was conflated with the &lt;a href=&quot;http://www.w3.org/TR/REC-xml/#charsets&quot;&gt;xml 1.0&lt;/a&gt; character list to the detriment of the stated purpose ('is this valid utf-8?' determination).&lt;/p&gt;

</description>
		</item>
		<item>
			<title>by Keith</title>
			<link>http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex#comment5303</link>
			<guid isPermaLink="false">http://keithdevens.com/weblog/4982#comment5303</guid>
			<pubDate>Thu, 19 Aug 2004 16:20:00 +0000</pubDate>
			<description>&lt;p class=&quot;st-markup&quot;&gt;Thanks for pointing that out Ken! I'd wondered about that myself but didn't think too much about it. You're probably right that they conflated the XML 1.0 character definition with the definition of UTF-8.&lt;/p&gt;

</description>
		</item>
		<item>
			<title>by Andy</title>
			<link>http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex#comment10680</link>
			<guid isPermaLink="false">http://keithdevens.com/weblog/4982#comment10680</guid>
			<pubDate>Tue, 20 May 2008 21:36:41 +0000</pubDate>
			<description>&lt;p class=&quot;st-markup&quot;&gt;Quick Q if someone can tell me. Is Â valid utf-8 or must it be &amp;amp;#160;  ? Trying to insert XML (unencoded) into a DB that was reportedly set for UTF-8. It is failing and I am trying to understand if I need to encode or the DBA should fix the DB.&lt;/p&gt;

</description>
		</item>
		<item>
			<title>by Keith</title>
			<link>http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex#comment10681</link>
			<guid isPermaLink="false">http://keithdevens.com/weblog/4982#comment10681</guid>
			<pubDate>Wed, 21 May 2008 04:13:06 +0000</pubDate>
			<description>&lt;p class=&quot;st-markup&quot;&gt;Run the above regular expression against your XML to see if the Â is actually encoded in UTF-8 or whether it's Latin-1 or something. If the db is supposed to accept UTF-8 and your XML passes the regex, then the db is wrong. More likely, your XML isn't actually encoded in UTF-8.&lt;/p&gt;

</description>
		</item>
		<item>
			<title>by Al</title>
			<link>http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex#comment11191</link>
			<guid isPermaLink="false">http://keithdevens.com/weblog/4982#comment11191</guid>
			<pubDate>Thu, 25 Jun 2009 19:29:22 +0000</pubDate>
			<description>&lt;p class=&quot;st-markup&quot;&gt;&amp;quot;Confalted&amp;quot; -- cool learned a new word &lt;img class=&quot;smiley&quot; src=&quot;/images/smiley_side.gif&quot; alt=&quot;Smiley&quot; /&gt;&lt;br /&gt;
Actually came for the regex(very useful thanks!)&lt;/p&gt;

</description>
		</item>
	</channel>
</rss>
