KBD

Keith Devens .com

Sunday, March 21, 2010 Flag waving
It's such a fine line between stupid and clever. – David St. Hubbins (This is Spinal Tap)
← My site is now fully unicode-ized and xhtml-izedSize does matter! Shorter is better :) →

Daily link icon Saturday, May 15, 2004

Unicode testing

I figured it'd be fun to paste in some foreign-language text and see how my site handles it now that I do Unicode Smiley

I got the following texts from this helpful Unicode test page. I really wanted to find the Shema somewhere where it wasn't shown as an image, but I wasn't able to.

По оживлённым берегам
Громады стройные теснятся
Дворцов и башен; корабли
Толпой со всех концов земли
К богатым пристаням стремятся;

Ἰοὺ ἰού· τὰ πάντʼ ἂν ἐξήκοι σαφῆ.
Ὦ φῶς, τελευταῖόν σε προσϐλέψαιμι νῦν,
ὅστις πέφασμαι φύς τʼ ἀφʼ ὧν οὐ χρῆν, ξὺν οἷς τʼ
οὐ χρῆν ὁμιλῶν, οὕς τέ μʼ οὐκ ἔδει κτανών.

पशुपतिरपि तान्यहानि कृच्छ्राद्
अगमयदद्रिसुतासमागमोत्कः ।
कमपरमवशं न विप्रकुर्युर्
विभुमपि तं यदमी स्पृशन्ति भावाः ॥

This is supposedly Chinese, but my browser doesn't have the fonts installed so I get a bunch of question marks:

子曰:「學而時習之,不亦說乎?有朋自遠方來,不亦樂乎?
人不知而不慍,不亦君子乎?」

有子曰:「其為人也孝弟,而好犯上者,鮮矣;
不好犯上,而好作亂者,未之有也。君子務本,本立而道生。
孝弟也者,其為仁之本與!」

ஸ்றீனிவாஸ ராமானுஜன் ஐயங்கார்

بِسْمِ ٱللّٰهِ ٱلرَّحْمـَبنِ ٱلرَّحِيمِ

ٱلْحَمْدُ لِلّٰهِ رَبِّ ٱلْعَالَمِينَ

ٱلرَّحْمـَبنِ ٱلرَّحِيمِ

مَـالِكِ يَوْمِ ٱلدِّينِ

إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ

ٱهْدِنَــــا ٱلصِّرَاطَ ٱلمُسْتَقِيمَ

صِرَاطَ ٱلَّذِينَ أَنعَمْتَ عَلَيهِمْ غَيرِ ٱلمَغضُوبِ عَلَيهِمْ وَلاَ ٱلضَّالِّين

Here's testing some punctiation:

The convention in English is “to use double quotation marks to indicate quotation, and ‘single quotation marks’ for nested quotations.”

En français la convention est « d'utiliser les guillemets français doubles pour les citations, et “ les guillemets anglais doubles ” ou bien ‹ les guillemets français simples › pour les citations imbriquées. »

Auf Deutsch ist die Vereinbarung »umgekehrte zweifache Anführungszeichen für die Zitate zu benutzen, sogar ›einfache Anführungszeichen‹ für die verschachtelte Zitate«; diese Anführungszeichen „dürfen auch solche ‚englische‘ Anführungszeichen sein.“

The en-dash is used between numbers such as in: 1685–1750 (J. S. Bach). It is longer than the hyphen (as in “en-dash”, or, more properly, “en‐dash”) but shorter than the em-dash, which is used — like this — as a sort of parenthesis. Neither should be confused with the horizontal bar which is used to introduce quotation in some cases.
― Like this?
― Right.

The ellipsis is… well, it just is.

It'll be interesting to see if my StructuredText parser dies on all of this.

Other than that, a concern I have is that someone would be able to post invalid Unicode data to my site and have me take it. I wonder if any other environments (than PHP) ensure that text coming into it is in the proper encoding.

Ok, here goes. Will I corrupt MySQL? Will I make PHP crash? Tune in next time, same bat time, same bat channel.


Ok then. Everything seems to have worked (but of course I still can't verify the Chinese), except that the Arabic is not right-justified. I wonder what I have to do to do that. The page I got this from had the following to begin the Arabic blockquote: <blockquote xml:lang="ar" lang="ar" dir="rtl">. Language and direction are two things I guess I should shoehorn into my StructuredText parser. Interestingly, that page's encoding was ASCII and they used all entities to include the other languages, so that page didn't actually use Unicode itself at all.

← My site is now fully unicode-ized and xhtml-izedSize does matter! Shorter is better :) →

Comments XML gif

Adam V. wrote:

I see Asian characters, not question marks. Though of course I can't verify that they are in fact Chinese characters.

∴ Adam V. | 15-May-2004 1:22pm est | #4586

Keith (http://keithdevens.com/) wrote:

Cool, thanks.

Keith | 15-May-2004 2:18pm est | http://keithdevens.com/ | #4587

Adam V. wrote:

If you have an MS Office CD set handy, it may have the "Arial Unicode MS" font on it. This font used to be free for download, but apparently it has been removed. It's a rather large font (23MB!), but worth tracking down if you're dealing with Unicode issues.

Some combination of Firefox and Windows chose to render this entry using that font, so I can see all the languages above.

∴ Adam V. | 15-May-2004 2:33pm est | #4588

Keith (http://keithdevens.com/) wrote:

I just tested under IE6. It has boxes all over the place for many of the different languages. Mozilla deals with Unicode better.

Keith | 15-May-2004 4:02pm est | http://keithdevens.com/ | #4589

209.114.245.216 wrote:

The Chinese text is, in fact, Chinese. Some sort of classical poem, I'm guessing. It's actually traditional Chinese characters, so I'm going to paste some simplified characters into this comment:

这达标、那达标,
都要农民掏腰包;

这大办、那大办,
都是农民血和汗。

∴ 209.114.245.216 | 15-May-2004 5:52pm est | #4590

Keith Gaughan (http://talideon.com/) wrote:

Some nitpicking: you're not supposed to have spaces on each side of an em-dash.

Bit annoyed today because my Compilers final didn't go as well as it should have because my lecturer (who set the paper) is an idiot. Grrr! Smiley frowning

∴ Keith Gaughan | 17-May-2004 10:13am est | http://talideon.com/ | #4595

Randy Charles Morin (http://www.kbcafe.com) wrote:

∴ Randy Charles Morin | 1-Jun-2004 3:21pm est | http://www.kbcafe.com | #4716

Keith (http://keithdevens.com/) wrote:

What the heck? I thought I followed their guidelines precisely, but I made a mistake. Thanks for pointing that out. Now my page validates.

Keith | 4-Jun-2004 7:36pm est | http://keithdevens.com/ | #4731

Gerardo (http://ase-usa.net) wrote:

Your testing of chinese characters works fine on my browser. I am particularly having a hard time on a project where i have to use XSLT and XML doc and an XSL style sheet to display a language translation. Do you know of any specific declarations or markups that have to be made to the XML or XSL for this to occur?? Is your page an XSLT transaltion? many thanks anyone!

∴ Gerardo | 2-Jul-2004 2:55pm est | http://ase-usa.net | #4900

Keith (http://keithdevens.com/) wrote:

Do you know of any specific declarations or markups that have to be made to the XML or XSL for this to occur?? Is your page an XSLT transaltion? many thanks anyone!

My page is not an XSLT translation. The only thing I can recommend to you is that you make sure all of your text is in Unicode.

Keith | 3-Jul-2004 6:03am est | http://keithdevens.com/ | #4904

Gerardo (http://ase-usa.net) wrote:

Thanks for your response. I went away on vacation last week (yeah -- it was too short). I just read my previous post. I meant to say transform instead of translate. I am using the .Net framework's XslTransform class' Transform method. I'm not even sure if you are using .Net for your blog. It's hard to find help on encoding! Thanks again.

Gerardo

∴ Gerardo | 12-Jul-2004 10:31am est | http://ase-usa.net | #4985

John DOe wrote:

这达标、那达标,

都要农民掏腰包;

这大办、那大办,

都是农民血和汗。

Just testing...

∴ John DOe | 3-Apr-2009 11:18pm est | #11105

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

March 2010
SunMonTueWedThuFriSat
 123456
78910111213
14151617181920
21222324252627
28293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 3 posts

Recent comments XML

new⇒Spider solitaire

I to am somewhat addicted to​spending too much time on SS.  I​have been stu...

stupid_horse: Mar 20, 10:34pm

I hate ASP.NET

I hate ASP... I was doing wonders​with PHP, then suddenly one of my​clients...

Johnies: Mar 17, 6:14am

Quantum physics and free will

I knew you were going to say that....

Tom Massey: Mar 15, 9:26pm

Generated in about 0.13s.

(Used 8 db queries)