The Big Three in Java, Part Three: hashCode()

Introduction

We’re onto the third blog post in a four-part series about what I call the “big three” methods in Java. All of the big three methods are inherited from the Object class, and they define behaviours that should generally be overridden in any class we implement. These methods are Object.toString(), Object.equals(Object), and Object.hashCode().

The first blog post in this series covered the Object.toString() method, and the second post introduced the Object.equals(Object) method. I’ll add a link to the final post in this series, which will explore some additional subtleties of the equals(Object) method in class hierarchies, once it’s published.

Today, we’ll explore the hashCode() method: its purpose, how to write an implementation that’s “consistent” with an equals(Object) method, and how to make that implementation effective and efficient.

Finding an Object in a List

To motivate why the Object.hashCode() method exists at all, let’s return to the Cow class we’ve been developing throughout this series of articles. A Cow has two instance variables: its name and its age:

public class Cow
{
    private String name;
    private int age;

    public Cow(String name, int age)
    {
        this.name = name;
        this.age = age;
    }
}

Let’s assume, now, that I have a collection of Cow objects stored in an ArrayList. We’ll fill this collection by having the user repeatedly input names and ages of cows. In the previous blog post, we wrote an inputCow() method — it has the user input information about a single cow, and it returns a new Cow object:

private static Scanner scan = new Scanner(System.in);

public static Cow inputCow()
{
    System.out.print("Cow name: ");
    String name = scan.nextLine();
    System.out.print("Cow age: ");
    int age = scan.nextInt();
    scan.nextLine();
    return new Cow(name, age);
}

We can call that method repeatedly to generate an ArrayList of Cow objects:

private static final int NUM_COWS = 3;

public static List<Cow> inputCowList()
{
    List<Cow> ret = new ArrayList<>();
    for (int i = 0; i < NUM_COWS; i++) {
        ret.add(inputCow());
    }
    return ret;
}

Using these methods, let’s construct a small program. In this program, the user will be asked to input a list of cows with the inputCowList() method. Then, the user will be asked to input information about a single cow, cowToFind, with the inputCow() method. The program will then search the List of cows to see if cowToFind exists inside the list:

public static void main(String[] args)
{
    List<Cow> cowList = inputCowList();
    System.out.println("Now, find a cow");
    Cow cowToFind = inputCow();
    if (cowList.contains(cowToFind)) {
        System.out.println("Cow found");
    }
    else {
        System.out.println("Cow not found");
    }
}

Let’s try running this program, and see if we can find a five-year-old cow named Molly in our list. We input information about Molly as one of the three Cows to put into our List, then input the same information a second time to check if she’s in our list. But, even though we know a Cow object representing Molly is in our list, the program outputs “Cow not found“:

$ java Cow
Cow name: Bessy
Cow age: 3
Cow name: Buttercup
Cow age: 4
Cow name: Molly
Cow age: 5
Now, find a cow
Cow name: Molly
Cow age: 5
Cow not found

Importing Our Cow.equals(Object) Method

The reason we didn’t find Molly in our list of three cows is that we haven’t implemented a custom Cow.equals(Object) method in our Cow class. The List.contains(Object) method uses of the argument’s equals(Object) method, to test if that object is in the List. That is to say, a Cow object’s equals(Object) method is used in the search.

But, without overriding the default implementation of Cow.equals(Object) inherited from the Object class, two Cow objects are considered equal only if they’re the exact same object.

Reaching back into our previous blog post in this series, let’s copy the custom Cow.equals(Object) method we wrote into our current program. This method considers two Cow objects to be equal if their names and ages are the same:

public class Cow
{
    [...]

    @Override
    public boolean equals(Object obj)
    {
        if (this == obj) {
            return true;
        }

        if (!(obj instanceof Cow)) {
            return false;
        }

        Cow other = (Cow)obj;

        // Assumes this.name is non-null
        return (this.name.equals(other.name) &&
            this.age == other.age);
    }
}

Now, when we run our program, our search finds Molly in our list of Cow objects:

$ java Cow
Cow name: Bessy
Cow age: 3
Cow name: Buttercup
Cow age: 4
Cow name: Molly
Cow age: 5
Now, find a cow
Cow name: Molly
Cow age: 5
Cow found

Splitting the List Apart

When we searched for Molly in our list of Cow objects, we had to search through the entire List to find her. Let’s explore one way to make searching for Molly more efficient — let’s break up the collection of Cow objects into multiple lists.

As an example, we could store all the cows with odd ages in one list, and all the cows with even ages in another list. Then, when we want to search for Molly, age 5, we only have to search the list of odd-aged cows.

We could use even more lists if we wanted. Perhaps we could store all the cows aged 0-2 in one list, 3-5 in another list, and 6+ in a third list. Or we could split apart the cows by the first letter of their name. There are myriad possibilities for how we could split Cow objects into different Lists.

As a starting idea, we could write an instance method in the Cow class that takes the number of Lists as input, and returns the index of the List which that Cow object should be in. For example:

public class Cow
{
    [...]

    // Not a good design
    public int whichList(int numLists)
    {
        return this.age % numLists;
    }
}

While this idea could work, it unnecessarily couples the Cow class with where Cow objects are going to be stored. A better design would have the class responsible for storing Cow objects decide where are how to store those objects.

Uncouple the Object and Its Storage

To address this coupling, let’s turn the problem around. Instead of telling the Cow object how many lists there are and making it do the work of choosing one, let’s instead have the Cow object return an integer. Then, the data structure where the Cow is stored chooses how to convert that integer into a list index.

For example, instead of telling the Cow object that there are two lists and making it return a list index of 0 or 1 based on its age, just have the Cow object return its age:

public class Cow
{
    [...]

    // There's still a better way
    public int whichList()
    {
        return this.age;
    }
}

Then, it’ll be up to the storage data structure to determine how to convert that integer value, the cow’s age, into a list index. If the storage data structure has two lists, it might divide the Cow objects into odd and even ages. Or, it might divide the Cow objects into young cows and old cows. The key takeaway is that it shouldn’t be up to the Cow object to decide which list it goes into — it should be up to the data structure where the Cow will be stored. And, it can make that decision based on the integer that each Cow object returns.

The Object.hashCode() Method

The more general term to describe splitting our Cow objects into different lists is to say that we’re splitting our Cow objects into different buckets. One data structure that can split Cow objects into different buckets — based on an integer that each object returns — is a HashSet.

In order to get an integer value from each Cow object, which is used to choose a bucket for that Cow, the HashSet calls an instance method on each Cow that’s similar to our whichList() method. However, the name of the standard function that HashSet calls isn’t Cow.whichList(). The proper name of the method is Cow.hashCode().

The hashCode() method is implemented in the Object class, and we can override that method in Cow. By overriding the method, we can specify which integer — called a hash code — each Cow object will return for use in bucketing. We can still use the Cow‘s age as its hash code. Then, perhaps the HashSet will split Cows into those with odd hash codes and those with even hash codes. Or, maybe it’ll split them up in some other manner.

Whatever way a HashSet chooses to convert hash codes into bucket indices, there are two important points to note. First, if two Cow objects return the same integer value from hashCode(), they’re guaranteed to be placed in the same bucket. If, on the other hand, two Cow objects return different integers from hashCode(), they may or may not wind up in the same bucket — it’s depends how the HashSet chooses to convert those different integer values into bucket indices.

Let’s change the name of our whichList() method to hashCode(), and add the @Override annotation (because, as discussed in the first blog post in this series, we’re overriding a method from Object):

public class Cow
{
    [...]

    @Override
    public int hashCode()
    {
        return this.age;
    }
}

Converting Our List to a HashSet

With the hashCode() method overridden, we can now change the inputCowList() method into an inputCowSet() method. This method returns a HashSet of Cow objects, instead of an ArrayList of Cow objects:

private static final int NUM_COWS = 3;

public static Set<Cow> inputCowSet()
{
    Set<Cow> ret = new HashSet<>();
    for (int i = 0; i < NUM_COWS; i++) {
        ret.add(inputCow());
    }
    return ret;
}

A Set, like a List, has a contains(Object) method. So, our main program remains largely unchanged:

public static void main(String[] args)
{
    Set<Cow> cowSet = inputCowSet();
    System.out.println("Now, find a cow");
    Cow cowToFind = inputCow();
    if (cowSet.contains(cowToFind)) {
        System.out.println("Cow found");
    }
    else {
        System.out.println("Cow not found");
    }
}

When we run this new program, we see that the output is unchanged — our cow Molly is found in the HashSet:

$ java Cow
Cow name: Bessy
Cow age: 3
Cow name: Buttercup
Cow age: 4
Cow name: Molly
Cow age: 5
Now, find a cow
Cow name: Molly
Cow age: 5
Cow found

We were able to find Molly in the HashSet of Cow objects because the hashCode() method returned some kind of information — an integer — that partially identified the object. Knowing a cow’s hash code (for example, its age) isn’t enough to uniquely identify that cow, but it does serve as partial identification of the cow. The HashSet class used that partially identifying information to choose which bucket to put Molly in.

When we asked whether the HashSet contains a new Cow object representing Molly, the HashSet used that same partially identifying information to look in the correct bucket, and saw that Molly was there.

Default hashCode() Implementation

The previous example, in which we found our cow Molly in a HashSet, relied on us overriding the default implementation of Cow.hashCode() that was inherited from the Object class. To understand why we have to override hashCode(), let’s take a look at the default version of hashCode().

The return value of the Object.hashCode() method, when it isn’t overridden in any subclass, is the identity hash code of the object. The identity hash code of an object — that is, its default hash code — is an int value generated for each individual object, behind the scenes, by Java.

You can also find the identity hash code of an object using the System.identityHashCode(Object) method. So, the following code, which uses the default Object.hashCode() implementation, will output two identical values:

public static void main(String[] args)
{
    Object o = new Object();
    int h1 = o.hashCode();
    int h2 = System.identityHashCode(o);
    System.out.println(h1);
    System.out.println(h2);
}

One thing to note is that the identity hash code for each individual object in a program is likely unique, but it isn’t guaranteed to be unique.

Keeping that in mind, let’s consider what our program looks like after we’ve input the information about Molly for a second time, and created a second Cow object representing her:

$ java Cow
Cow name: Bessy
Cow age: 3
Cow name: Buttercup
Cow age: 4
Cow name: Molly
Cow age: 5
Now, find a cow
Cow name: Molly
Cow age: 5
[...]

At this point in our program, we have two different Cow objects representing Molly. One of them is in the HashSet, and the other is passed to the HashSet‘s contains(Object) method. Those two Cow objects likely, but not necessarily, have different identity hash codes.

So, if we don’t override the hashCode() method in the Cow class to return something other than the Cow‘s identity hash code, the two Cow objects representing Molly will likely, but not necessarily, have different hash codes. That means that the two Cow objects representing Molly could, but not necessarily, belong in different buckets in the HashSet.

So, would the first object representing Molly be found in the HashSet when we run the Set‘s contains(Object) method? In the likely case that the two Molly objects go into different buckets, no. But, by small chance, the two Molly objects might belong in the same bucket — either because they have the same identity hash code, or because their two different identity hash codes get converted to the same bucket index by the HashSet. In that case, the contains(Object) call would find the original Cow object representing Molly in the Set.

The behaviour and output of our program, if we haven’t overridden the hashCode() method in the Cow class, will be indeterminate. Maybe Molly is found in the HashSet, or maybe (likely) not.

Fundamentally, the cause of this indeterminate behaviour is that two objects that compare as equal — using the Cow.equals(Object) method — might wind up in different buckets in a HashSet. To solve this problem, we have to ensure that two equal objects always wind up in the same bucket.

Consistency with equals(Object)

To ensure that two equal objects always belong to the same bucket, two objects that compare as equal with the equals(Object) method must return the same hash code from their hashCode() methods. This requirement is laid out in the Java API for hashCode(). It’s typically phrased as: the hashCode() method of an object has to be consistent with its equals(Object) method.

The default Object.hashCode() method, which returns the identity hash code of an object, is consistent with the default implementation of Object.equals(Object). Recall that the default Object.equals(Object) method returns true only if the two Object references point to the exact same object. Because no two distinct objects ever compare as equal, each individual object can use whatever value it likes as its hash code — for example, its identity hash code.

But once we override the equals(Object) method in a class, two objects with different identity hash codes could compare as equal. Because those identity hash codes get returned by the default hashCode() method as each object’s hash code, two objects that compare as equal could have different hash codes. This means that the default hashCode() method is no longer consistent with equals(Object). In order to maintain hashCode()‘s consistency with equals(Object), any time we override the equals(Object) method, we also have to override the hashCode() method.

A Consistent, if Useless, hashCode() Implementation

If our goal is just to make sure any two Cow objects that compare as equal have the same hash code, we could use the same hash code for all Cow objects. The following hashCode() implementation is consistent with any implementation of equals(Object):

@Override
public int hashCode()
{
    return 0;
}

However, this implementation of hashCode() would defeat the purpose of using a HashSet. If every Cow object gets placed into the same bucket in a HashSet, then searching through a HashSet for a Cow object would be no more efficient than searching for that object in a single List that contains all the Cows. We can wind up searching through all the objects in a HashSet in an attempt to find one of them, since we may have to look through the entire (single) bucket of Cow objects.

Dispersing Objects with hashCode()

A better implementation of Cow.hashCode() attempts to disperse Cow objects into different buckets, while still ensuring that two Cow objects that compare as equal wind up in the same bucket. We can accomplish this by having Cow.hashCode() utilize some or all of the member variables used to compute Cow.equals(Object). Our earlier example, which uses the age of a Cow as its hash code, is consistent with Cow.equals(Object):

public class Cow
{
    [...]

    @Override
    public int hashCode()
    {
        return this.age;
    }
}

Any two Cow objects that compare as equal, in addition to having the same name, must also have the same age. So, any two equal Cow objects will have the same hashCode() return value. Assuming that not all cows are the same age, this implementation of hashCode() should also result in Cow objects being distributed across different buckets in a HashSet. The better the distribution of Cow objects among buckets, the more efficient our search for any single Cow will be.

Note that two objects which don’t compare as equal can still have the same hashCode() value. Specifically, two Cow objects with the same age but different names won’t compare as equal, but they’ll use the same age as their hash code. That’s not a problem per se. It can affect the performance of HashSet lookups in our program when non-equal objects have the same hash code, since they’ll wind up in the same bucket. But, it won’t affect correctness.

In short: it’s an error for two objects that compare as equal to have different hash codes. It’s just a performance issue, but not an error, for two objects that aren’t equal to have the same hash code.

As a general rule, this means that our implementation of hashCode() should only use instance variables used in our implementation of equals(Object). The hashCode() method should never use a field that could be different between two objects that nonetheless compare as equal. For example, consider if we were to add a new instance variable to the Cow class called isChewingCud. If we still want two Cow objects to compare as equal if and only if their name and age fields are the same (with no consideration of isChewingCud), using isChewingCud as the return for hashCode() would be incorrect.

Using Multiple Instance Variables

When two non-equal objects have the same hash code, this is called a hash collision. Hash collisions reduce how efficiently we can search for objects in a HashSet, because they cause objects to get grouped together into fewer buckets. While objects with different hash codes might wind up in the same bucket, objects with the same hash code will be placed in the same bucket. So, we should try to minimize how frequently hash collisions occur.

To do so, we can use more of the information considered by the equals(Object) method to build the return value for hashCode(). By using additional instance variables to create a more complex hash code, we can make it more likely that non-equal objects will have different hash codes.

For example, we could use both a Cow‘s name and age instance variables to compute its hash code. Naturally, the name variable is a String. So, to use it somehow in the computation of a hash code, we need to convert the String value to an int. To do that, we can actually ask name for its hash code:

int nameHash = this.name.hashCode();

With name represented as an int, we could, for example, add it to age to generate a hash code:

public class Cow
{
    [...]

    @Override
    public int hashCode()
    {
        // Assumes this.name is non-null
        int nameHash = this.name.hashCode();
        return nameHash + this.age;
    }
}

Because age is already an int, we didn’t have to do anything to convert it to a value that can be used in the hash code computation.

Note that our code assumed a Cow‘s name is always non-null. If it’s possible that name could be null, we can set nameHash to 0 when it is. We could do this with an if/else statement; but, the way I find easier to read is to use the Objects.hashCode(Object) method. It returns its argument’s hash code when the argument is non-null, or 0 when the argument is null:

@Override
public int hashCode()
{
    int nameHash = Objects.hashCode(this.name);
    return nameHash + this.age;

    // Alternatively:
    // int nameHash;
    // if (this.name == null) {
    //     nameHash = 0;
    // }
    // else {
    //     nameHash = this.name.hashCode();
    // }
    // return nameHash + this.age;
}

Multiplying by 31

It turns out, though, that we can do even better than just adding the two values together. Instead, we can multiply one of the values by 31 before adding it to the other:

@Override
public int hashCode()
{
    int nameHash = Objects.hashCode(this.name);
    return 31 * nameHash + this.age;
}

When we multiply one of the values by 31 before adding it to the other, we increase the number of bits in the final hash code that are affected by our different instance variables, making it more likely that non-equal objects have different hash codes.

Using the number 31 as a multiplier isn’t as arbitrary as it might seem. The binary representation of 31 is all 1s (11111 in binary). That means that, compared to any other 5-bit-long multiplier, 31 maximizes the number of bits that nameHash contributes to the final sum.

Also, because 31 is 1 less than a power of 2 (it’s $2^5-1$), the multiplication can be computed very quickly with a bit shift followed by a subtraction. For an int value x:

31 * x == (x << 5) - x

This bit-shift operation isn’t anything we need to write in our Java code; it’s an optimization the Java compiler can run behind the scenes when it sees multiplication by 31.

Finally, using a prime number as the multiplier makes it less likely that systematic, numerical patterns will emerge in the hash codes of different objects. Numerical patterns can lead to objects with different hash codes frequently being placed in the same, small number of buckets in a HashSet, negatively affecting performance.

It’s worth noting that there’s a built-in method, Objects.hash(Object...), that computes a value similar to what we computed above. That is, we could have written:

public class Cow
{
    [...]

    @Override
    public int hashCode()
    {
        // Different return value from above
        return Objects.hash(this.name, this.age);
    }
}

The Objects.hash(Object...) method doesn’t return exactly the same value as our manual computation, and it’s less efficient to run (because it constructs an array behind the scenes). But, it can be useful for prototyping a Cow.hashCode() method that’s consistent with Cow.equals(Object).

Multiplication by 31 With More Than Two Instance Variables

Let’s generalize the idea of multiplying one instance variable’s hash code by 31 before adding it to the other. We can do something similar in the case where we have more than two instance variables to use in our hash code computation. Here, we can track an intermediate value in the computation, and multiply that intermediate value by 31 before adding in each new field’s hash code.

For example, let’s say our Cow class were to have two additional instance variables, height and weight, both doubles. Let’s also assume that two Cow objects compare as equal only when they have the same name, age, height, and weight.

Using repeated multiplication by 31 on an intermediate value looks like the following:

@Override
public int hashCode()
{
    int ret = Objects.hashCode(this.name);
    ret = 31 * ret + this.age;
    ret = 31 * ret + Objects.hashCode(this.height);
    ret = 31 * ret + Objects.hashCode(this.weight);
    return ret;
}

This approach is equivalent to multiplying each instance variable’s hash code by a different power of 31 ($31^0 = 1$, $31^1 = 31$, $31^2 = 961$, and $31^3 = 29791$) before adding it to a running sum. Multiplying each hash code by a different power of 31 increases the number of bits that each instance variable affects in the final result.

Be careful when repeatedly multiplying intermediate values by 31, though, because there’s a common mistake I see students make. Be sure that it’s the intermediate computation you multiply by 31 before adding in each instance variable’s hash code, as opposed to just multiplying each instance variable’s hash code by 31. As an example of this mistake, here age, and the hash codes of height and weight, are all multiplied by the same value of 31:

@Override
public int hashCode()
{
    // COMMON MISTAKE
    int ret = Objects.hashCode(this.name);
    ret = ret + 31 * this.age
    ret = ret + 31 * Objects.hashCode(this.height);
    ret = ret + 31 * Objects.hashCode(this.weight);
    return ret;
}

Because the hash codes of each instance variable are all multiplied by the same value, 31, the multiplication doesn’t distribute the bits from each of the instance variables across different bits in the final hash code. It’s essentially the same as if we added together, with no multiplication, age and the hash codes of height and weight. So, be sure instead that it’s the intermediate value, ret, that we multiply by 31.

Final Versions of hashCode() and equals(Object)

Let’s put together the Cow.equals(Object) and Cow.hashCode() methods, in the version of the Cow class with two instance variables, name and age. Our version of Cow.equals(Object) compares two cows as equal when their names and ages are the same. Our final version of Cow.hashCode() method is consistent with Cow.equals(Object), and uses multiplication by 31 to effectively distribute Cow hash codes across the buckets in a HashSet:

public class Cow
{
    private String name;
    private int age;

    [...]

    @Override
    public boolean equals(Object obj)
    {
        if (this == obj) {
            return true;
        }

        if (!(obj instanceof Cow)) {
            return false;
        }

        Cow other = (Cow)obj;

        // Assumes this.name is non-null
        return (this.name.equals(other.name) &&
            this.age == other.age);
    }

    @Override
    public int hashCode()
    {
        int ret = Objects.hashCode(this.name);
        ret = 31 * ret + this.age;
        return ret;
    }
}

Exercise

The identity hash code of an object — returned by the default implementation of Object.hashCode(), and by the System.identityHashCode(Object) method — doesn’t necessarily return a value that’s unique for every object. Two different objects may, though it’s not likely, have the same identity hash code.

For this exercise, demonstrate that two different objects can have the same identity hash code. Create a HashSet of Integer values to store the identity hash codes of objects, and repeatedly create new objects until you see a duplicate value. How many objects did you have to create before you saw a duplicate identity hash code?

Conclusion

When we create our own classes in Java, we should generally override the “big three” methods: Object.toString(), Object.equals(Object), and Object.hashCode(). In today’s blog post, we explored how to override the Object.hashCode() method and why we do it.

The hashCode() method is used to split objects into buckets in a data structure like a HashSet. When we search for an object in a HashSet, we have to ensure that we’re searching in the correct bucket. Because of that, the value returned by the hashCode() method has to be consistent with the return value of the object’s equals(Object) method. Two objects that compare as equal must have the same hash code.

Practically speaking, this consistency requirement means that anytime we override the equals(Object) method in a class, we also need to override the hashCode() method.

A general approach for writing an effective hashCode() method that’s consistent with equals(Object) is to add the hash code of each instance variable used in equals(Object) to a running intermediate value. But, before adding each individual hash code into that running value, multiply the intermediate value — not the instance variable’s individual hash code — by 31.

For more tips, and to arrange for personalized tutoring for yourself or your study group, check out Vancouver Computer Science Tutoring.