Skip to main content

Why does Java allow control characters in its identifiers?



The Mystery





In exploring precisely which characters were permitted in Java identifiers, I have stumbled upon something so extremely curious that it seems nearly certain to be a bug.





I’d expected to find that Java identifiers conformed to the requirement that they start with characters that have the Unicode property ID_Start and are followed by those with the property ID_Continue , with an exception granted for leading underscores and for dollar signs. That did not prove to be the case, and what I found is at extreme variance with that or any other idea of a normal identifier that I have heard of.





Short Demo





Consider the following demonstration proving that an ASCII ESC character (octal 033) is permitted in Java identifiers:







$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: \033"; System.out.println(var_\033); }})' > escape.java

$ javac escape.java

$ java escape | cat -v

i am escape: ^[







It’s even worse than that, though. Almost infinitely worse, in fact. Even NULLs are permitted! And thousands of other code points that are not even identifier characters. I have tested this on Solaris, Linux, and a Mac running Darwin, and all give the same results.





Long Demo





Here is a test program that will show all these unexpected code points that Java quite outrageuosly allows as part of a legal identifier name.







#!/usr/bin/env perl

#

# test-java-idchars - find which bogus code points Java allows in its identifiers

#

# usage: test-java-idchars [low high]

# e.g.: test-java-idchars 0 255

#

# Without arguments, tests Unicode code points

# from 0 .. 0x1000. You may go further with a

# higher explicit argument.

#

# Produces a report at the end.

#

# You can ^C it prematurely to end the program then

# and get a report of its progress up to that point.

#

# Tom Christiansen

# tchrist@perl.com

# Sat Jan 29 10:41:09 MST 2011



use strict;

use warnings;



use encoding "Latin1";

use open IO => ":utf8";



use charnames ();



$| = 1;



my @legal;



my ($start, $stop) = (0, 0x1000);



if (@ARGV != 0) {

if (@ARGV == 1) {

for (($stop) = @ARGV) {

$_ = oct if /^0/; # support 0OCTAL, 0xHEX, 0bBINARY

}

}

elsif (@ARGV == 2) {

for (($start, $stop) = @ARGV) {

$_ = oct if /^0/;

}

}

else {

die "usage: $0 [ [start] stop ]\n";

}

}



for my $cp ( $start .. $stop ) {

my $char = chr($cp);



next if $char =~ /[\s\w]/;



my $type = "?";

for ($char) {

$type = "Letter" if /\pL/;

$type = "Mark" if /\pM/;

$type = "Number" if /\pN/;

$type = "Punctuation" if /\pP/;

$type = "Symbol" if /\pS/;

$type = "Separator" if /\pZ/;

$type = "Control" if /\pC/;

}

my $name = $cp ? (charnames::viacode($cp) || "<missing>") : "NULL";

next if $name eq "<missing>" && $cp > 0xFF;

my $msg = sprintf("U+%04X %s", $cp, $name);

print "testing \\p{$type} $msg...";

open(TESTPROGRAM, ">:utf8", "testchar.java") || die $!;



print TESTPROGRAM <<"End_of_Java_Program";



public class testchar {

public static void main(String argv[]) {

String var_$char = "variable name ends in $msg";

System.out.println(var_$char);

}

}



End_of_Java_Program



close(TESTPROGRAM) || die $!;



system q{

( javac -encoding UTF-8 testchar.java \

&& \

java -Dfile.encoding=UTF-8 testchar | grep variable \

) >/dev/null 2>&1

};



push @legal, sprintf("U+%04X", $cp) if $? == 0;



if ($? && $? < 128) {

print "<interrupted>\n";

exit; # from a ^C

}



printf "is %s in Java identifiers.\n",

($? == 0) ? uc "legal" : "forbidden";



}



END {

print "Legal but evil code points: @legal\n";

}







Here is a sample of running that program on just the first 33 code points that are neither whitespace nor identifier characters:







$ perl test-java-idchars 0 0x20

testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers.

testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers.

testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers.

testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers.

testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers.

testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers.

testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers.

testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers.

testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers.

testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers.

testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers.

testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers.

testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers.

testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers.

testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers.

testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers.

testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers.

testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers.

testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers.

testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers.

testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers.

testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers.

testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers.

testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers.

testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers.

testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers.

testing \p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers.

testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers.

Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B







And here is another demo:







$ perl test-java-idchars 0x600 0x700 | grep -i legal

testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers.

testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers.

testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers.

testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers.

testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers.

Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD







The Question





Can anyone please explain this seemingly insane behavior? There are many, many, many other inexplicably permitted code points all over the place, starting right off with U+0000, which is perhaps the strangest of all. If you run it on the first 0x1000 code points, you do see certain patterns appear, such as permitting any and all code points with the property Current_Symbol . But too much else is wholly inexplicable, at least by me.


Comments

  1. The Java Language Specification section 3.8 defers to Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart(). The latter, among other conditions, has Character.isIdentifierIgnorable(), which allows non-whitespace control characters (including whole C1 range, see the link for the list).

    ReplyDelete
  2. Another question might be: Why shouldn't Java allow control characters in its identifiers?

    A good principle when designing a language or other system, is to not forbid anything without good cause, since you never know how it might be used, and the less rules implementers and users have to contend with, the better.

    It is true that you certainly shouldn't take advantage of this, by actually embedding escapes into your variable names, and you won't see any popular libraries that expose classes with null characters in them.

    Certainly, this could be abused, but it isn't the language designers job to protect programmers from themselves in this way, any more than by forcing proper indentation or well-chosen variable names.

    ReplyDelete
  3. You can use unicode escapes in the code to refer to the variables, i. e.

    int a\u0000 = 9;


    is valid Java code. That way you don't have to need the "evil" characters in the source code.

    (You can use Unicode escapes everywhere else, too, for example for the whitespace or even inside keywords... That can be confusing, as \u0022 will end a string, but I guess it is just that the Java designers decided to keep it consistent.)

    ReplyDelete
  4. I don't see what's the big deal. How does it affect you in anyway?

    If a developer wants to obfuscate his code, he can do it with ASCII.

    If a developer wants to make his code understandable, he will use the lingua franca of the industry: English. Not only identifiers are ASCII only, but also from common English words. Otherwise, nobody will use or read his code, he can use whatever crazy characters he likes.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex