Introduction
We’re onto the third blog post in a four-part series about what I call the “big three” methods in Java. All of the big three methods are inherited from the Object
class, and they define behaviours that should generally be overridden in any class we implement. These methods are Object.toString()
, Object.equals(Object)
, and Object.hashCode()
.
The first blog post in this series covered the Object.toString()
method, and the second post introduced the Object.equals(Object)
method. I’ll add a link to the final post in this series, which will explore some additional subtleties of the equals(Object)
method in class hierarchies, once it’s published.
Today, we’ll explore the hashCode()
method: its purpose, how to write an implementation that’s “consistent” with an equals(Object)
method, and how to make that implementation effective and efficient.
Finding an Object in a List
To motivate why the Object.hashCode()
method exists at all, let’s return to the Cow
class we’ve been developing throughout this series of articles. A Cow
has two instance variables: its name
and its age
:
public class Cow
{
private String name;
private int age;
public Cow(String name, int age)
{
this.name = name;
this.age = age;
}
}
Let’s assume, now, that I have a collection of Cow
objects stored in an ArrayList
. We’ll fill this collection by having the user repeatedly input names and ages of cows. In the previous blog post, we wrote an inputCow()
method — it has the user input information about a single cow, and it returns a new Cow
object:
private static Scanner scan = new Scanner(System.in);
public static Cow inputCow()
{
System.out.print("Cow name: ");
String name = scan.nextLine();
System.out.print("Cow age: ");
int age = scan.nextInt();
scan.nextLine();
return new Cow(name, age);
}
We can call that method repeatedly to generate an ArrayList
of Cow
objects:
private static final int NUM_COWS = 3;
public static List<Cow> inputCowList()
{
List<Cow> ret = new ArrayList<>();
for (int i = 0; i < NUM_COWS; i++) {
ret.add(inputCow());
}
return ret;
}
Using these methods, let’s construct a small program. In this program, the user will be asked to input a list of cows with the inputCowList()
method. Then, the user will be asked to input information about a single cow, cowToFind
, with the inputCow()
method. The program will then search the List
of cows to see if cowToFind
exists inside the list:
public static void main(String[] args)
{
List<Cow> cowList = inputCowList();
System.out.println("Now, find a cow");
Cow cowToFind = inputCow();
if (cowList.contains(cowToFind)) {
System.out.println("Cow found");
}
else {
System.out.println("Cow not found");
}
}
Let’s try running this program, and see if we can find a five-year-old cow named Molly in our list. We input information about Molly as one of the three Cow
s to put into our List
, then input the same information a second time to check if she’s in our list. But, even though we know a Cow
object representing Molly is in our list, the program outputs “Cow not found
“:
$ java Cow
Cow name: Bessy
Cow age: 3
Cow name: Buttercup
Cow age: 4
Cow name: Molly
Cow age: 5
Now, find a cow
Cow name: Molly
Cow age: 5
Cow not found
Importing Our Cow.equals(Object) Method
The reason we didn’t find Molly in our list of three cows is that we haven’t implemented a custom Cow.equals(Object)
method in our Cow
class. The List.contains(Object)
method uses of the argument’s equals(Object)
method, to test if that object is in the List
. That is to say, a Cow
object’s equals(Object)
method is used in the search.
But, without overriding the default implementation of Cow.equals(Object)
inherited from the Object
class, two Cow
objects are considered equal only if they’re the exact same object.
Reaching back into our previous blog post in this series, let’s copy the custom Cow.equals(Object)
method we wrote into our current program. This method considers two Cow
objects to be equal if their names and ages are the same:
public class Cow
{
[...]
@Override
public boolean equals(Object obj)
{
if (this == obj) {
return true;
}
if (!(obj instanceof Cow)) {
return false;
}
Cow other = (Cow)obj;
// Assumes this.name is non-null
return (this.name.equals(other.name) &&
this.age == other.age);
}
}
Now, when we run our program, our search finds Molly in our list of Cow
objects:
$ java Cow
Cow name: Bessy
Cow age: 3
Cow name: Buttercup
Cow age: 4
Cow name: Molly
Cow age: 5
Now, find a cow
Cow name: Molly
Cow age: 5
Cow found
Splitting the List Apart
When we searched for Molly in our list of Cow
objects, we had to search through the entire List
to find her. Let’s explore one way to make searching for Molly more efficient — let’s break up the collection of Cow
objects into multiple lists.
As an example, we could store all the cows with odd ages in one list, and all the cows with even ages in another list. Then, when we want to search for Molly, age 5, we only have to search the list of odd-aged cows.
We could use even more lists if we wanted. Perhaps we could store all the cows aged 0-2 in one list, 3-5 in another list, and 6+ in a third list. Or we could split apart the cows by the first letter of their name. There are myriad possibilities for how we could split Cow
objects into different List
s.
As a starting idea, we could write an instance method in the Cow
class that takes the number of List
s as input, and returns the index of the List
which that Cow
object should be in. For example:
public class Cow
{
[...]
// Not a good design
public int whichList(int numLists)
{
return this.age % numLists;
}
}
While this idea could work, it unnecessarily couples the Cow
class with where Cow
objects are going to be stored. A better design would have the class responsible for storing Cow
objects decide where are how to store those objects.
Uncouple the Object and Its Storage
To address this coupling, let’s turn the problem around. Instead of telling the Cow
object how many lists there are and making it do the work of choosing one, let’s instead have the Cow
object return an integer. Then, the data structure where the Cow
is stored chooses how to convert that integer into a list index.
For example, instead of telling the Cow
object that there are two lists and making it return a list index of 0
or 1
based on its age, just have the Cow
object return its age:
public class Cow
{
[...]
// There's still a better way
public int whichList()
{
return this.age;
}
}
Then, it’ll be up to the storage data structure to determine how to convert that integer value, the cow’s age, into a list index. If the storage data structure has two lists, it might divide the Cow
objects into odd and even ages. Or, it might divide the Cow
objects into young cows and old cows. The key takeaway is that it shouldn’t be up to the Cow
object to decide which list it goes into — it should be up to the data structure where the Cow
will be stored. And, it can make that decision based on the integer that each Cow
object returns.
The Object.hashCode() Method
The more general term to describe splitting our Cow
objects into different lists is to say that we’re splitting our Cow
objects into different buckets. One data structure that can split Cow
objects into different buckets — based on an integer that each object returns — is a HashSet
.
In order to get an integer value from each Cow
object, which is used to choose a bucket for that Cow
, the HashSet
calls an instance method on each Cow
that’s similar to our whichList()
method. However, the name of the standard function that HashSet
calls isn’t Cow.whichList()
. The proper name of the method is Cow.hashCode()
.
The hashCode()
method is implemented in the Object
class, and we can override that method in Cow
. By overriding the method, we can specify which integer — called a hash code — each Cow
object will return for use in bucketing. We can still use the Cow
‘s age
as its hash code. Then, perhaps the HashSet
will split Cow
s into those with odd hash codes and those with even hash codes. Or, maybe it’ll split them up in some other manner.
Whatever way a HashSet
chooses to convert hash codes into bucket indices, there are two important points to note. First, if two Cow
objects return the same integer value from hashCode()
, they’re guaranteed to be placed in the same bucket. If, on the other hand, two Cow
objects return different integers from hashCode()
, they may or may not wind up in the same bucket — it’s depends how the HashSet
chooses to convert those different integer values into bucket indices.
Let’s change the name of our whichList()
method to hashCode()
, and add the @Override
annotation (because, as discussed in the first blog post in this series, we’re overriding a method from Object
):
public class Cow
{
[...]
@Override
public int hashCode()
{
return this.age;
}
}
Converting Our List to a HashSet
With the hashCode()
method overridden, we can now change the inputCowList()
method into an inputCowSet()
method. This method returns a HashSet
of Cow
objects, instead of an ArrayList
of Cow
objects:
private static final int NUM_COWS = 3;
public static Set<Cow> inputCowSet()
{
Set<Cow> ret = new HashSet<>();
for (int i = 0; i < NUM_COWS; i++) {
ret.add(inputCow());
}
return ret;
}
A Set
, like a List
, has a contains(Object)
method. So, our main program remains largely unchanged:
public static void main(String[] args)
{
Set<Cow> cowSet = inputCowSet();
System.out.println("Now, find a cow");
Cow cowToFind = inputCow();
if (cowSet.contains(cowToFind)) {
System.out.println("Cow found");
}
else {
System.out.println("Cow not found");
}
}
When we run this new program, we see that the output is unchanged — our cow Molly is found in the HashSet
:
$ java Cow
Cow name: Bessy
Cow age: 3
Cow name: Buttercup
Cow age: 4
Cow name: Molly
Cow age: 5
Now, find a cow
Cow name: Molly
Cow age: 5
Cow found
We were able to find Molly in the HashSet
of Cow
objects because the hashCode()
method returned some kind of information — an integer — that partially identified the object. Knowing a cow’s hash code (for example, its age) isn’t enough to uniquely identify that cow, but it does serve as partial identification of the cow. The HashSet
class used that partially identifying information to choose which bucket to put Molly in.
When we asked whether the HashSet
contains a new Cow
object representing Molly, the HashSet
used that same partially identifying information to look in the correct bucket, and saw that Molly was there.
Default hashCode() Implementation
The previous example, in which we found our cow Molly in a HashSet
, relied on us overriding the default implementation of Cow.hashCode()
that was inherited from the Object
class. To understand why we have to override hashCode()
, let’s take a look at the default version of hashCode()
.
The return value of the Object.hashCode()
method, when it isn’t overridden in any subclass, is the identity hash code of the object. The identity hash code of an object — that is, its default hash code — is an int
value generated for each individual object, behind the scenes, by Java.
You can also find the identity hash code of an object using the System.identityHashCode(Object)
method. So, the following code, which uses the default Object.hashCode()
implementation, will output two identical values:
public static void main(String[] args)
{
Object o = new Object();
int h1 = o.hashCode();
int h2 = System.identityHashCode(o);
System.out.println(h1);
System.out.println(h2);
}
One thing to note is that the identity hash code for each individual object in a program is likely unique, but it isn’t guaranteed to be unique.
Keeping that in mind, let’s consider what our program looks like after we’ve input the information about Molly for a second time, and created a second Cow
object representing her:
$ java Cow
Cow name: Bessy
Cow age: 3
Cow name: Buttercup
Cow age: 4
Cow name: Molly
Cow age: 5
Now, find a cow
Cow name: Molly
Cow age: 5
[...]
At this point in our program, we have two different Cow
objects representing Molly. One of them is in the HashSet
, and the other is passed to the HashSet
‘s contains(Object)
method. Those two Cow
objects likely, but not necessarily, have different identity hash codes.
So, if we don’t override the hashCode()
method in the Cow
class to return something other than the Cow
‘s identity hash code, the two Cow
objects representing Molly will likely, but not necessarily, have different hash codes. That means that the two Cow
objects representing Molly could, but not necessarily, belong in different buckets in the HashSet
.
So, would the first object representing Molly be found in the HashSet
when we run the Set
‘s contains(Object)
method? In the likely case that the two Molly objects go into different buckets, no. But, by small chance, the two Molly objects might belong in the same bucket — either because they have the same identity hash code, or because their two different identity hash codes get converted to the same bucket index by the HashSet
. In that case, the contains(Object)
call would find the original Cow
object representing Molly in the Set
.
The behaviour and output of our program, if we haven’t overridden the hashCode()
method in the Cow
class, will be indeterminate. Maybe Molly is found in the HashSet
, or maybe (likely) not.
Fundamentally, the cause of this indeterminate behaviour is that two objects that compare as equal — using the Cow.equals(Object)
method — might wind up in different buckets in a HashSet
. To solve this problem, we have to ensure that two equal objects always wind up in the same bucket.
Consistency with equals(Object)
To ensure that two equal objects always belong to the same bucket, two objects that compare as equal with the equals(Object)
method must return the same hash code from their hashCode()
methods. This requirement is laid out in the Java API for hashCode()
. It’s typically phrased as: the hashCode()
method of an object has to be consistent with its equals(Object)
method.
The default Object.hashCode()
method, which returns the identity hash code of an object, is consistent with the default implementation of Object.equals(Object)
. Recall that the default Object.equals(Object)
method returns true
only if the two Object
references point to the exact same object. Because no two distinct objects ever compare as equal, each individual object can use whatever value it likes as its hash code — for example, its identity hash code.
But once we override the equals(Object)
method in a class, two objects with different identity hash codes could compare as equal. Because those identity hash codes get returned by the default hashCode()
method as each object’s hash code, two objects that compare as equal could have different hash codes. This means that the default hashCode()
method is no longer consistent with equals(Object)
. In order to maintain hashCode()
‘s consistency with equals(Object)
, any time we override the equals(Object)
method, we also have to override the hashCode()
method.
A Consistent, if Useless, hashCode() Implementation
If our goal is just to make sure any two Cow
objects that compare as equal have the same hash code, we could use the same hash code for all Cow
objects. The following hashCode()
implementation is consistent with any implementation of equals(Object)
:
@Override
public int hashCode()
{
return 0;
}
However, this implementation of hashCode()
would defeat the purpose of using a HashSet
. If every Cow
object gets placed into the same bucket in a HashSet
, then searching through a HashSet
for a Cow
object would be no more efficient than searching for that object in a single List
that contains all the Cow
s. We can wind up searching through all the objects in a HashSet
in an attempt to find one of them, since we may have to look through the entire (single) bucket of Cow
objects.
Dispersing Objects with hashCode()
A better implementation of Cow.hashCode()
attempts to disperse Cow
objects into different buckets, while still ensuring that two Cow
objects that compare as equal wind up in the same bucket. We can accomplish this by having Cow.hashCode()
utilize some or all of the member variables used to compute Cow.equals(Object)
. Our earlier example, which uses the age of a Cow
as its hash code, is consistent with Cow.equals(Object)
:
public class Cow
{
[...]
@Override
public int hashCode()
{
return this.age;
}
}
Any two Cow
objects that compare as equal, in addition to having the same name
, must also have the same age
. So, any two equal Cow
objects will have the same hashCode()
return value. Assuming that not all cows are the same age, this implementation of hashCode()
should also result in Cow
objects being distributed across different buckets in a HashSet
. The better the distribution of Cow
objects among buckets, the more efficient our search for any single Cow
will be.
Note that two objects which don’t compare as equal can still have the same hashCode()
value. Specifically, two Cow
objects with the same age but different names won’t compare as equal, but they’ll use the same age as their hash code. That’s not a problem per se. It can affect the performance of HashSet
lookups in our program when non-equal objects have the same hash code, since they’ll wind up in the same bucket. But, it won’t affect correctness.
In short: it’s an error for two objects that compare as equal to have different hash codes. It’s just a performance issue, but not an error, for two objects that aren’t equal to have the same hash code.
As a general rule, this means that our implementation of hashCode()
should only use instance variables used in our implementation of equals(Object)
. The hashCode()
method should never use a field that could be different between two objects that nonetheless compare as equal. For example, consider if we were to add a new instance variable to the Cow
class called isChewingCud
. If we still want two Cow
objects to compare as equal if and only if their name
and age
fields are the same (with no consideration of isChewingCud
), using isChewingCud
as the return for hashCode()
would be incorrect.
Using Multiple Instance Variables
When two non-equal objects have the same hash code, this is called a hash collision. Hash collisions reduce how efficiently we can search for objects in a HashSet
, because they cause objects to get grouped together into fewer buckets. While objects with different hash codes might wind up in the same bucket, objects with the same hash code will be placed in the same bucket. So, we should try to minimize how frequently hash collisions occur.
To do so, we can use more of the information considered by the equals(Object)
method to build the return value for hashCode()
. By using additional instance variables to create a more complex hash code, we can make it more likely that non-equal objects will have different hash codes.
For example, we could use both a Cow
‘s name
and age
instance variables to compute its hash code. Naturally, the name
variable is a String
. So, to use it somehow in the computation of a hash code, we need to convert the String
value to an int
. To do that, we can actually ask name
for its hash code:
int nameHash = this.name.hashCode();
With name
represented as an int
, we could, for example, add it to age
to generate a hash code:
public class Cow
{
[...]
@Override
public int hashCode()
{
// Assumes this.name is non-null
int nameHash = this.name.hashCode();
return nameHash + this.age;
}
}
Because age
is already an int
, we didn’t have to do anything to convert it to a value that can be used in the hash code computation.
Note that our code assumed a Cow
‘s name is always non-null
. If it’s possible that name
could be null
, we can set nameHash
to 0
when it is. We could do this with an if
/else
statement; but, the way I find easier to read is to use the Objects.hashCode(Object)
method. It returns its argument’s hash code when the argument is non-null
, or 0
when the argument is null
:
@Override
public int hashCode()
{
int nameHash = Objects.hashCode(this.name);
return nameHash + this.age;
// Alternatively:
// int nameHash;
// if (this.name == null) {
// nameHash = 0;
// }
// else {
// nameHash = this.name.hashCode();
// }
// return nameHash + this.age;
}
Multiplying by 31
It turns out, though, that we can do even better than just adding the two values together. Instead, we can multiply one of the values by 31 before adding it to the other:
@Override
public int hashCode()
{
int nameHash = Objects.hashCode(this.name);
return 31 * nameHash + this.age;
}
When we multiply one of the values by 31 before adding it to the other, we increase the number of bits in the final hash code that are affected by our different instance variables, making it more likely that non-equal objects have different hash codes.
Using the number 31 as a multiplier isn’t as arbitrary as it might seem. The binary representation of 31 is all 1
s (11111
in binary). That means that, compared to any other 5-bit-long multiplier, 31 maximizes the number of bits that nameHash
contributes to the final sum.
Also, because 31 is 1 less than a power of 2 (it’s $2^5-1$), the multiplication can be computed very quickly with a bit shift followed by a subtraction. For an int
value x
:
31 * x == (x << 5) - x
This bit-shift operation isn’t anything we need to write in our Java code; it’s an optimization the Java compiler can run behind the scenes when it sees multiplication by 31.
Finally, using a prime number as the multiplier makes it less likely that systematic, numerical patterns will emerge in the hash codes of different objects. Numerical patterns can lead to objects with different hash codes frequently being placed in the same, small number of buckets in a HashSet
, negatively affecting performance.
It’s worth noting that there’s a built-in method, Objects.hash(Object...)
, that computes a value similar to what we computed above. That is, we could have written:
public class Cow
{
[...]
@Override
public int hashCode()
{
// Different return value from above
return Objects.hash(this.name, this.age);
}
}
The Objects.hash(Object...)
method doesn’t return exactly the same value as our manual computation, and it’s less efficient to run (because it constructs an array behind the scenes). But, it can be useful for prototyping a Cow.hashCode()
method that’s consistent with Cow.equals(Object)
.
Multiplication by 31 With More Than Two Instance Variables
Let’s generalize the idea of multiplying one instance variable’s hash code by 31 before adding it to the other. We can do something similar in the case where we have more than two instance variables to use in our hash code computation. Here, we can track an intermediate value in the computation, and multiply that intermediate value by 31 before adding in each new field’s hash code.
For example, let’s say our Cow
class were to have two additional instance variables, height
and weight
, both double
s. Let’s also assume that two Cow
objects compare as equal only when they have the same name
, age
, height
, and weight
.
Using repeated multiplication by 31 on an intermediate value looks like the following:
@Override
public int hashCode()
{
int ret = Objects.hashCode(this.name);
ret = 31 * ret + this.age;
ret = 31 * ret + Objects.hashCode(this.height);
ret = 31 * ret + Objects.hashCode(this.weight);
return ret;
}
This approach is equivalent to multiplying each instance variable’s hash code by a different power of 31 ($31^0 = 1$, $31^1 = 31$, $31^2 = 961$, and $31^3 = 29791$) before adding it to a running sum. Multiplying each hash code by a different power of 31 increases the number of bits that each instance variable affects in the final result.
Be careful when repeatedly multiplying intermediate values by 31, though, because there’s a common mistake I see students make. Be sure that it’s the intermediate computation you multiply by 31 before adding in each instance variable’s hash code, as opposed to just multiplying each instance variable’s hash code by 31. As an example of this mistake, here age
, and the hash codes of height
and weight
, are all multiplied by the same value of 31:
@Override
public int hashCode()
{
// COMMON MISTAKE
int ret = Objects.hashCode(this.name);
ret = ret + 31 * this.age
ret = ret + 31 * Objects.hashCode(this.height);
ret = ret + 31 * Objects.hashCode(this.weight);
return ret;
}
Because the hash codes of each instance variable are all multiplied by the same value, 31, the multiplication doesn’t distribute the bits from each of the instance variables across different bits in the final hash code. It’s essentially the same as if we added together, with no multiplication, age
and the hash codes of height
and weight
. So, be sure instead that it’s the intermediate value, ret
, that we multiply by 31.
Final Versions of hashCode() and equals(Object)
Let’s put together the Cow.equals(Object)
and Cow.hashCode()
methods, in the version of the Cow
class with two instance variables, name
and age
. Our version of Cow.equals(Object)
compares two cows as equal when their names and ages are the same. Our final version of Cow.hashCode()
method is consistent with Cow.equals(Object)
, and uses multiplication by 31 to effectively distribute Cow
hash codes across the buckets in a HashSet
:
public class Cow
{
private String name;
private int age;
[...]
@Override
public boolean equals(Object obj)
{
if (this == obj) {
return true;
}
if (!(obj instanceof Cow)) {
return false;
}
Cow other = (Cow)obj;
// Assumes this.name is non-null
return (this.name.equals(other.name) &&
this.age == other.age);
}
@Override
public int hashCode()
{
int ret = Objects.hashCode(this.name);
ret = 31 * ret + this.age;
return ret;
}
}
Exercise
The identity hash code of an object — returned by the default implementation of Object.hashCode()
, and by the System.identityHashCode(Object)
method — doesn’t necessarily return a value that’s unique for every object. Two different objects may, though it’s not likely, have the same identity hash code.
For this exercise, demonstrate that two different objects can have the same identity hash code. Create a HashSet
of Integer
values to store the identity hash codes of objects, and repeatedly create new objects until you see a duplicate value. How many objects did you have to create before you saw a duplicate identity hash code?
Conclusion
When we create our own classes in Java, we should generally override the “big three” methods: Object.toString()
, Object.equals(Object)
, and Object.hashCode()
. In today’s blog post, we explored how to override the Object.hashCode()
method and why we do it.
The hashCode()
method is used to split objects into buckets in a data structure like a HashSet
. When we search for an object in a HashSet
, we have to ensure that we’re searching in the correct bucket. Because of that, the value returned by the hashCode()
method has to be consistent with the return value of the object’s equals(Object)
method. Two objects that compare as equal must have the same hash code.
Practically speaking, this consistency requirement means that anytime we override the equals(Object)
method in a class, we also need to override the hashCode()
method.
A general approach for writing an effective hashCode()
method that’s consistent with equals(Object)
is to add the hash code of each instance variable used in equals(Object)
to a running intermediate value. But, before adding each individual hash code into that running value, multiply the intermediate value — not the instance variable’s individual hash code — by 31.
For more tips, and to arrange for personalized tutoring for yourself or your study group, check out Vancouver Computer Science Tutoring.