Skip to main content

Java: Why charset names are not constants?



Charset issues are confusing and complicated by themselves, but on top of that you have to remember exact names of your charsets. Is it "utf8" ? Or "utf-8" ? Or maybe "UTF-8" ? When searching internet for code samples you will see all of the above. Why not just make them named constants and use Charset.UTF8 ?




Comments

  1. The simple answer to the question asked is that the available charset strings vary from platform to platform.

    However, there are six that are required to be present, so constants could have been made for those long ago. I don't know why they weren't.

    JDK 1.4 did a great thing by introducing the Charset type. At this point, they wouldn't have wanted to provide String constants anymore, since the goal is to get everyone using Charset instances. So why not provide the six standard Charset constants, then? I asked Martin Buchholz since he happens to be sitting right next to me, and he said there wasn't a really particularly great reason, except that at the time, things were still half-baked -- too few JDK APIs had been retrofitted to accept Charset, and of the ones that were, the Charset overloads usually performed slightly worse.

    It's sad that it's only in JDK 1.6 that they finally finished outfitting everything with Charset overloads. And that this backwards performance situation still exists (the reason why is incredibly weird and I can't explain it, but is related to security!).

    Long story short -- just define your own constants, or use Guava's Charsets class which Tony the Pony linked to (though that library is not really actually released yet).

    ReplyDelete
  2. Two years later, and Java 7's StandardCharsets now defines constants for the 6 standard charsets.

    If you are stuck on Java 5/6, you can use Guava's Charsets constants, as suggested by Kevin Bourrillion and Jon Skeet.

    ReplyDelete
  3. I'd argue that we can do much better than that... why aren't the guaranteed-to-be-available charsets accessible directly? Charset.UTF8 should be a reference to the Charset, not the name as a string. That way we wouldn't have to handle UnsupportedEncodingException all over the place.

    Mind you, I also think that .NET chose a better strategy by defaulting to UTF-8 everywhere. It then screwed up by naming the "operating system default" encoding property simply Encoding.Default - which isn't the default within .NET itself :(

    Back to ranting about Java's charset support - why isn't there a constructor for FileWriter/FileReader which takes a Charset? Basically those are almost useless classes due to that restriction - you almost always need an InputStreamReader around a FileInputStream or the equivalent for output :(

    Nurse, nurse - where's my medicine?

    EDIT: It occurs to me that this hasn't really answered the question. The real answer is presumably either "nobody involved thought of it" or "somebody involved thought it was a bad idea." I would strongly suggest that in-house utility classes providing the names or charsets avoid duplication around the codebase... Or you could just use the one that we use at Google.

    ReplyDelete
  4. The current state of the encoding API leaves something to be desired. Some parts of the Java 6 API don't accept Charset in place of a string (in logging, dom.ls, PrintStream; there may be others). It doesn't help that encodings are supposed to have different canonical names for different parts of the standard library.

    I can understand how things got to where they are; not sure I have any brilliant ideas about how to fix them.



    As an aside...

    You can look up the names for Sun's Java 6 implementation here.

    For UTF-8, the canonical values are "UTF-8" for java.nio and "UTF8" for java.lang and java.io. The only encodings the spec requires a JRE to support are: US-ASCII; ISO-8859-1; UTF-8; UTF-16BE; UTF-16LE; UTF-16.

    ReplyDelete
  5. I have long ago defined a utility class with UTF_8, ISO_8859_1 and US_ASCII Charset constants.

    Also, some long time ago ( 2+ years ) I did a simple performance test between new String( byte[], Charset ) and new String( byte[], String charset_name ) and discovered that the latter implementation is CONSIDERABLY faster. If you take a look under the hood at the source code you will see that they indeed follow quite a different path.

    For that reason I included a utility in the same class

    public static String stringFromByteArray (
    final byte[] array,
    final Charset charset
    )
    {
    try
    {
    return new String( array, charset.name( ) )
    }
    catch ( UnsupportedEncodingException ex )
    {
    // cannot happen
    }
    }


    Why the String( byte[], Charset ) constructor does not do the same, beats me.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex