Keith Devens .com |
Saturday, October 11, 2008 | ![]() |
| The three chief virtues of a programmer are: Laziness, Impatience and Hubris. – Larry Wall (Programming Perl) | ||
|
| ← Jabber server | AOL using SPF → |

Jon Hanna (http://www.hackcraft.net/) wrote:
Keith (http://keithdevens.com/) wrote:
If the 16bit encoding you refer to is something else...
I just meant something like what Java and C# use natively to store Unicode strings.
You aren't guaranteed to be able to do that with C++ (nor, I understand, with PHP).
Why not? PHP strings and C++ strings are 8-bit clean. Obviously asking the language for the length of the string won't return the correct number of characters unless the string contains all ASCII characters, but would you recommend against storing the raw UTF-8 data in a string like that for other reasons?
Jon Hanna (http://www.hackcraft.net/) wrote:
I just meant something like what Java and C# use natively to store Unicode strings.
Grand so, few worries there.
Why not? PHP strings and C++ strings are 8-bit clean.
Yes, you can put UTF-8 in them, and you will generally be safe with them. Surprises can arise when you come to use a function that isn't expecting it to be UTF-8, and strlen() for example will return the number of code-units rather than the number of code-points as you state. But if you're hip to the possibility of stuff like that then you can be quite safe in storing stuff in UTF-8.
That said, I find UTF-16 is a easier to work with that UTF-8 in C++. YMMV.
Feel free to post a comment below. Please see my comment policy.
Formatting Rules (No HTML):
Generated in about 0.3s.
(Used 8 db queries)

You aren't guaranteed to be able to do that with C++ (nor, I understand, with PHP). Really, it depends on the application (if it doesn't say it'll handle UTF-8 in this way then it probably won't).
If the application uses UTF-16 or UTF-32 then the code to transcode from UTF-8 is simple and can work efficiently on a streaming basis. If it's UCS-2 it's easy enough (you just need to work out what you're going to do if the UTF-8 contains a character UCS-2 doesn't contain). If the 16bit encoding you refer to is something else then the complexity will vary according to the encoding in question.