Skip to main content

What is a sensible prime for hashcode calculation?


Eclipse 3.5 has a very nice feature to generate Java hashCode() functions. It would generate for example (slightly shortened:)




class HashTest {
int i;
int j;
public int hashCode() {
final int prime = 31;
int result = prime + i;
result = prime * result + j;
return result;
}
}



(If you have more attributes in the class, result = prime * result + attribute.hashCode(); is repeated for each additional attribute. For ints .hashCode() can be omitted.)



This seems fine but for the choice 31 for the prime. It is probably taken from the hashCode implementation of Java String , which was used for performance reasons that are long gone after the introduction of hardware multipliers. Here you have many hashcode collisions for small values of i and j: for example (0,0) and (-1,31) have the same value. I think that is a Bad Thing(TM), since small values occur often. For String.hashCode you'll also find many short strings with the same hashcode, for instance "Ca" and "DB". If you take a large prime, this problem disappears if you choose the prime right.



So my question: what is a good prime to choose? What criteria do you apply to find it?



This is meant as a general question - so I do not want to give a range for i and j. But I suppose in most applications relatively small values occur more often than large values. (If you have large values the choice of the prime is probably unimportant.) It might not make much of a difference, but a better choice is an easy and obvious way to improve this - so why not do it? Commons lang HashCodeBuilder also suggests curiously small values.


Source: Tips4allCCNA FINAL EXAM

Comments

  1. To give a meaningful answer to this you have to know something about the possible values of i and j. The only thing I can think of in general is, that in many cases small values will be more common than large values. (The odds of 15 appearing as a value in your program are much better than, say, 438281923.) So it seems a good idea to make the smallest hashcode collision as large as possible by choosing an appropriate prime. For 31 this rather bad - already for i=-1 and j=31 you have the same hash value as for i=0 and j=0.

    Since this is interesting, I've written a little program that searched the whole int range for the best prime in this sense. That is, for each prime I searched for the minimum value of Math.abs(i)+Math.abs(j) over all values of i,j that have the same hashcode as 0,0, and then took the prime where this minimum value is as large as possible.

    Drumroll: the best prime in this sense is 486187739 (with the smallest collision being i=-25486, j=67194). Nearly as good and much easier to remember is 92821 with the smallest collision being i=-46272 and j=46016.

    If you give "small" another meaning and want to be the minimum of Math.sqrt(i*i+j*j) for the collision as large as possible, the results are a little different: the best would be 1322837333 with i=-6815 and j=70091, but my favourite 92821 (smallest collision -46272,46016) is again almost as good as the best value.

    I do acknowledge that it is quite debatable whether these calculation make much sense in practice. But I do think that taking 92821 as prime makes much more sense than 31, unless you have good reasons not to.

    ReplyDelete
  2. Collisions may not be such a big issue... The primary goal of the hash is to avoid using equals for 1:1 comparisons.
    If you have an implementation where equals is "generally" extremely cheap for objects that have collided hashs, then this is not an issue (at all).

    In the end, what is the best way of hashing depends on what you are comparing. In the case of an int pair (as in your example), using basic bitwise operators could be sufficient (as using & or ^).

    ReplyDelete
  3. Actually, if you take a prime so large that it comes close to INT_MAX, you have the same problem because of modulo arithmetic. If you expect to hash mostly strings of length 2, perhaps a prime near the square root of INT_MAX would be best, if the strings you hash are longer it doesn't matter so much and collisions are unavoidable anyway...

    ReplyDelete
  4. I'd choose 7243. Large enough to avoid collissions with small numbers. Doesn't overflow to small numbers quickly.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex