KBD

Keith Devens .com

Sunday, October 12, 2008 Flag waving
The very name "selection" implies that you're choosing between two or more variants. So that means that the end... – Dr. Walter Veith
← Flash is EVILLife in Iraq →

Daily link icon Monday, August 11, 2003

Java is obnoxious - or - Java character encodings considered harmful?

This guy goes into the exact problem I encountered with my Java code the other day. Unlike that guy, I realized what was going on, but I still don't know of any good solutions. I know of solutions, just not good ones.

Here's the problem. I'm working on a library for a file format that can store binary data as well as character data. In most languages, the distinction doesn't really matter, and I was able to store everything (including binary data) as strings without the language complaining. I'm intentionally trying to ignore Unicode for now since I honestly don't understand the issues deeply enough, but I do know that if I can store exactly what I got in then whatever application uses the library can get Unicode strings out of it.

But Java, always the problem child, forces all of its strings to be in a given character encoding, and translates whatever characters it doesn't understand into the dreaded 3F, or the ASCII question mark (?). I figured if I used an 8-bit encoding like ISO-8859-1 then Java should be able to deal with any 8 bits I give it. But instead of trusting the programmer and spitting out whatever bytes I give it, it actually goes through the trouble of translating characters it doesn't understand into question marks. So Java strings are completely useless for storing binary data.

This means that while in every other language I was just able to store binary data inside a string - which meant that I could use normal input and output functions, as well as the automatic memory management built-in - in Java I'm going to have to store my data as byte arrays, do my own memory management, and I or any other user of the library will probably have to worry about converting back and forth between bytes and characters. Yet again, Java makes my life harder than it should be.

← Flash is EVILLife in Iraq →

Comments XML gif

Adam Langley (http://www.imperialviolet.org) wrote:

The problem is that Java's concept of a string is an array of charactors and it stores them internally as Unicode. It then translates them to a given encoding as they go in and out.

Personally, I would be far happier if it stored them internally as byte arrays and only used the charactor encoding when doing string operations on them. That way you could treat them just as binary if you set a charactor encoding of binary' and splices etc would work on byte numbers. If you then tagged it as utf-8' splices would start using letter indexes insted.

(and just as a warning, Java can't handle unsigned byte arrays. The only Java programming I've ever done involved doing crypto and signed/unsigned problems convinced me never to use Java again).

AGL

∴ Adam Langley | 12-Aug-2003 5:12am est | http://www.imperialviolet.org | #2682

Keith (http://www.keithdevens.com/) wrote:

Personally, I would be far happier if it stored them internally as byte arrays and only used the charactor encoding when doing string operations on them.

Absolutely. Perl and Python seem to just store whatever in the string and don't generally make you worry about it until you want to get stuff out of it, which seems like the way to go. I question however, because Java seems to be the first language to have considered Unicode from the start, while most other languages such as Perl and Python have it tacked on like they do (Though they seem to make me worry about it less than Java does - figure that.) My point is that maybe Perl and Python would have done it more like Java had they considered Unicode from the start, but I'm not sure.

Here's how Tim Bray handled an implementation of Unicode strings for Java. See my wiki page on Unicode for more from him and otherwise. Feel free to add your own links Smiley winking

Keith | 12-Aug-2003 10:13am est | http://www.keithdevens.com/ | #2684

Sam Newman (http://www.magpiebrain.com) wrote:

So Java strings are completely useless for storing binary data.

Well yes. Its a string, of character data. I would no more use a String for storing binary data than I would use a byte array for storing a String for display purposes. Java Strings are desinged to be seen, and as such the String class and its associated classes are written and designed to make the process of using these strings for display purposes as easy as possible, so I am not suprised at all at its behaviour.

This means that while in every other language I was just able > to store binary data inside a string - which meant that I
could use normal input and output functions, as well as the
automatic memory management built-in - in Java I'm going to
have to store my data as byte arrays, do my own memory
management, and I or any other user of the library will
probably have to worry about converting back and forth
between bytes and characters. Yet again, Java makes my life
harder than it should be.

Im usnure what you mean by memory management here. In anycase the main thrust of my comment is that you are using Strings or something other than the purpose for which they were written. If you want to store some binary data, you could use a byte[] of course, but you'd be better off using a ByteBuffer (part of the nio package) or for earlier versions just use a ByteArrayInputStream.

∴ Sam Newman | 12-Aug-2003 12:11pm est | http://www.magpiebrain.com | #2685

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

October 2008
SunMonTueWedThuFriSat
 1234
567891011
12131415161718
19202122232425
262728293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 8 posts

Recent comments XML

new⇒URL design

http://groups.google.com/group/cool​ndex/web/asian-girl-sucking-to-blac​k-man...

derek: Oct 12, 12:13pm

I hate PHP

Elliot Anderson,

Dude!! You the​man! The reverse replacement for​array_u...

Alex Ndungu: Oct 11, 1:35am

Call a function from a string in Python

?!code:
some_object.__getattribute​__('method_name')()
?!/code

is​the s...

Patrick Corcoran: Oct 8, 3:53pm

Spider solitaire

I have won 185 games of Spider​Solitaire at the "Difficult" level.​ What is...

75.179.28.113: Oct 8, 12:42pm

Sed one-liners

Hi.

I wanted to let you know​that I wrote an article "Famous Sed​One-Lin...

Peteris Krumins: Oct 8, 3:05am

Timesheet Calculator

Hadn't seen it before now, but my​company already uses a time​tracking prog...

Keith: Oct 7, 10:44am

Girls, please don't get breast implants

Hey everyone, 

I am new to this​blog and I have enjoyed reading all​your...

Sarah.M.: Oct 6, 9:45am

Generated in about 0.174s.

(Used 8 db queries)

mobile phone