KBD

Keith Devens .com

Wednesday, May 21, 2008 Flag waving
Whereof one cannot speak, thereof one must be silent. – Ludwig Wittgenstein (Tractatus Logico-Philosophicus)
← Transcript of Scalia's speech on Constitutional interpretationGrady Booch's keynote on software complexity at AOSD | Lambda the Ultimate →

Daily link icon Wednesday, March 16, 2005

Iñtërnâtiônàlizætiøn

This is a test to see if my URIs are encoded correctly.

Anne claimed that a URI I had in Hebrew wasn't properly encoded (in UTF-8), but I'm not even sure how to check. As far as I know I'm doing all my Unicode handling correctly (in UTF-8), but ultimately I rely on my browser to give me the right UTF-8. Whatever I submit with my browser is what gets stuck in my database and ultimately winds up as a URI.

Sam has a URI-encoded version of 'Iñtërnâtiônàlizætiøn' that I can check against to see if it gets URI-ified correctly, so this is a test to see how my thing works... and, worked. I'm not sure what's up with the Hebrew if Anne is right.

← Transcript of Scalia's speech on Constitutional interpretationGrady Booch's keynote on software complexity at AOSD | Lambda the Ultimate →

Comments XML gif

Anne (http://annevankesteren.nl/) wrote:

You can see it quite easily if you hover the link "Anne claimed" in Firefox. You could also pass the hexadecimal encoded characters to something you know that uses UTF-8, like Google and see what the result is. This URI seems to be compliant with the IRI specification though.

∴ Anne | 17-Mar-2005 2:22am est | http://annevankesteren.nl/ | #7231

Keith (http://keithdevens.com/) wrote:

When I hover over it in Firefox the correct Hebrew is displayed in my status bar. Also, if I have the URI in my HTML without doing the URI percent-encoding, Firefox doesn't complain about invalid UTF-8 (which it would if it was, since I'm using XHTML). Assuming Firefox handles the Unicode for the Hebrew text correctly there's no way it should be incorrect, since it goes right from Firefox to my db.

When I said I don't know how to check, I meant that I'm not familiar enough with the Hebrew part of Unicode to know what the UTF-8 should be. But as further evidence that it's valid, I ran it through this regex and it said it was valid.

What makes you think it's invalid?

Keith | 17-Mar-2005 3:13am est | http://keithdevens.com/ | #7232

Anne (http://annevankesteren.nl/) wrote:

I get "‏שְׁמַ" as result which seems to be too short. Also, the URI itself seems to be too short as you need to have about 6 characters to express "ë". See http://www.w3.org/2003/06/mod_fileiri/#Testing

(Note also that you should probably enable UTF-8 for URIs in Firefox. That is on by default in recent nightlies.)

It looks more like legacy encoding than UTF-8 to me.

∴ Anne | 17-Mar-2005 3:22am est | http://annevankesteren.nl/ | #7233

Keith (http://keithdevens.com/) wrote:

I get "‏שְׁמַ" as result which seems to be too short.

Well, that's all I put. Two characters with two vowels. Though in fact the "Shema" should have included the silent third letter--‏שְׁמַ֖ע--so that was my mistake in only including the first two. I've never understood why Hebrew has all these silent letters!

Here's what I get if I break the URL up into sets of three octets:

%E2%80%8F %D7%A9%D7 %81%D6%B0 %D7%9E%D6 %B7%D6%96

So, that's big enough for four letters plus one. The plus one is most likely that the Hebrew is encoded such that it encodes the letter that serves as the base for both "sin" (שׂ) and "shin" (‏‏שׁ) separately from the dot (which I'm sure has a name that I forget). That makes five characters in total, so it adds up correctly.

Keith | 17-Mar-2005 3:39am est | http://keithdevens.com/ | #7234

Keith (http://keithdevens.com/) wrote:

Oh, by the way Anne. Just so I'm straight, IRIs are just URIs where non-ASCII characters don't have to be percent-encoded so long as they're in UTF-8, right?

What's the status of IRIs? Is the standard done? Are we just waiting for browsers to catch up so we can start using them? How bad is browser support, currently? (Sorry to bother you with so many questions.)

Keith | 17-Mar-2005 3:49am est | http://keithdevens.com/ | #7235

Anne (http://annevankesteren.nl/) wrote:

IRIs are a standard. It was published together with the new URI RFC. See RFC 3986 (URI) and RFC 3987 (IRI). Some browsers already support them if you use the correct configuration. Here is a simple testcase: http://www.w3.org/2001/08/iri-test/resumeHtmlImgSrcBase.html

(Recent nightlies of Mozilla show a green image there. You can get the same result in Firefox if you change some options in about:config.)

I wasn't aware that you just took a few characters of Hebrew. I thought you took the whole title, just like you are doing here.

∴ Anne | 17-Mar-2005 6:53am est | http://annevankesteren.nl/ | #7237

Keith (http://keithdevens.com/) wrote:

I wasn't aware that you just took a few characters of Hebrew. I thought you took the whole title, just like you are doing here.

No problem. I'm just glad to know it's not broken.

Thanks for the info on IRIs and the link to the test case. I played around with the Firefox setting and now I understand what the trouble is. The page encoding is latin-1, but the URIs always have to be in UTF-8 regardless of page encoding. So, it seems that if your pages use UTF-8 you get IRIs "for free" regardless of what your browser does?

Keith | 17-Mar-2005 2:29pm est | http://keithdevens.com/ | #7238

Anne (http://annevankesteren.nl/) wrote:

Yeah, I believe that is an advantage of UTF-8 :-)

∴ Anne | 17-Mar-2005 5:34pm est | http://annevankesteren.nl/ | #7239

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

May 2008
SunMonTueWedThuFriSat
 123
45678910
11121314151617
18192021222324
25262728293031



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 4 posts

Recent comments XML

new⇒The war in Iraq was brilliant military strategy

I'm sure you are right. The group​of people who lead this country,​from who...

70.114.31.99: May 21, 2:23pm

new⇒A regular expression to check for valid UTF-8

Run the above regular expression​against your XML to see if the Â​is actua...

Keith: May 20, 11:13pm

I hate Norton Antivirus

I absolutely hate Norton. The​renewal is ridiculously expensive,​it constan...

Norton hater: May 18, 10:30pm

Maps of Iraq

my husband is in Scania too..he​says it's not too bad..he's been at​worse...

Cristy: May 16, 3:54pm

Generated in about 0.204s.

(Used 8 db queries)