Keith Devens .com |
Wednesday, May 21, 2008 | ![]() |
| Whereof one cannot speak, thereof one must be silent. – Ludwig Wittgenstein (Tractatus Logico-Philosophicus) | ||
|
| ← Transcript of Scalia's speech on Constitutional interpretation | Grady Booch's keynote on software complexity at AOSD | Lambda the Ultimate → |

Anne (http://annevankesteren.nl/) wrote:
Keith (http://keithdevens.com/) wrote:
When I hover over it in Firefox the correct Hebrew is displayed in my status bar. Also, if I have the URI in my HTML without doing the URI percent-encoding, Firefox doesn't complain about invalid UTF-8 (which it would if it was, since I'm using XHTML). Assuming Firefox handles the Unicode for the Hebrew text correctly there's no way it should be incorrect, since it goes right from Firefox to my db.
When I said I don't know how to check, I meant that I'm not familiar enough with the Hebrew part of Unicode to know what the UTF-8 should be. But as further evidence that it's valid, I ran it through this regex and it said it was valid.
What makes you think it's invalid?
Anne (http://annevankesteren.nl/) wrote:
I get "שְׁמַ" as result which seems to be too short. Also, the URI itself seems to be too short as you need to have about 6 characters to express "ë". See http://www.w3.org/2003/06/mod_fileiri/#Testing
(Note also that you should probably enable UTF-8 for URIs in Firefox. That is on by default in recent nightlies.)
It looks more like legacy encoding than UTF-8 to me.
Keith (http://keithdevens.com/) wrote:
I get "שְׁמַ" as result which seems to be too short.
Well, that's all I put. Two characters with two vowels. Though in fact the "Shema" should have included the silent third letter--שְׁמַ֖ע--so that was my mistake in only including the first two. I've never understood why Hebrew has all these silent letters!
Here's what I get if I break the URL up into sets of three octets:
%E2%80%8F %D7%A9%D7 %81%D6%B0 %D7%9E%D6 %B7%D6%96
So, that's big enough for four letters plus one. The plus one is most likely that the Hebrew is encoded such that it encodes the letter that serves as the base for both "sin" (שׂ) and "shin" (שׁ) separately from the dot (which I'm sure has a name that I forget). That makes five characters in total, so it adds up correctly.
Keith (http://keithdevens.com/) wrote:
Oh, by the way Anne. Just so I'm straight, IRIs are just URIs where non-ASCII characters don't have to be percent-encoded so long as they're in UTF-8, right?
What's the status of IRIs? Is the standard done? Are we just waiting for browsers to catch up so we can start using them? How bad is browser support, currently? (Sorry to bother you with so many questions.)
Anne (http://annevankesteren.nl/) wrote:
IRIs are a standard. It was published together with the new URI RFC. See RFC 3986 (URI) and RFC 3987 (IRI). Some browsers already support them if you use the correct configuration. Here is a simple testcase: http://www.w3.org/2001/08/iri-test/resumeHtmlImgSrcBase.html
(Recent nightlies of Mozilla show a green image there. You can get the same result in Firefox if you change some options in about:config.)
I wasn't aware that you just took a few characters of Hebrew. I thought you took the whole title, just like you are doing here.
Keith (http://keithdevens.com/) wrote:
I wasn't aware that you just took a few characters of Hebrew. I thought you took the whole title, just like you are doing here.
No problem. I'm just glad to know it's not broken.
Thanks for the info on IRIs and the link to the test case. I played around with the Firefox setting and now I understand what the trouble is. The page encoding is latin-1, but the URIs always have to be in UTF-8 regardless of page encoding. So, it seems that if your pages use UTF-8 you get IRIs "for free" regardless of what your browser does?
Anne (http://annevankesteren.nl/) wrote:
Yeah, I believe that is an advantage of UTF-8 :-)
Feel free to post a comment below. Please see my comment policy.
Formatting Rules (No HTML):
Generated in about 0.204s.
(Used 8 db queries)
You can see it quite easily if you hover the link "Anne claimed" in Firefox. You could also pass the hexadecimal encoded characters to something you know that uses UTF-8, like Google and see what the result is. This URI seems to be compliant with the IRI specification though.