Friday, August 6, 2010

Public Constructors Considered Harmful: Or, the Amazing Misadventures of HashSet and HashMap

A few months ago I thought I'd try my hand at an implementation of hash array mapped tries. This is the data structure used by Scala (and Clojure) for immutable hash sets and maps. The data structure provides copy-on-write semantics with reasonably fast update and, after each update, a high degree of sharing between the old and new versions of the structure.

My purposes in this adventure were twofold:
  • To see whether I could come up with variations that improve on the performance of the standard Scala implementations.
  • To see whether I could browbeat Scala's type system into letting me use a single implementation for both sets and maps.

I managed to achieve the latter goal, although I went down a number of blind alleys before I got there. This was my first project with much use of higher-kinded types, and it took me a while to wrap my mind around the idea that I was, in effect, writing compiletime functions of types that produce new types. That's a powerful idea, but not what I wanted to get to in this post.

Trait Hierarchies with Lots of Variables Can Be Slow

My first attempt at the hash array mapped trie class hierarchy looked something like this (HashMap, type parameters, and most other details omitted—and we all know what's in the details):

trait HAMT { ... }
trait EmptyNode extends HAMT { ... }
trait InternalNode extends HAMT { ... array of nodes ... }
trait Leaf1 extends HAMT { ... leaf node with one element ... }
class HashSet extends ... Scala collection stuff ...
with EmptyNode {
def this () = ... constructor for an empty set ...
...
}
class InternalNodeHashSet extends HashSet with InternalNode
class Leaf1HashSet extends HashSet with Leaf1

This struck me as a natural phrasing of the solution: each of the components of a hash trie is itself a valid hash trie (at least the way I wrote it), and also a valid hashed set or map.

I got this approach to work, and compared its speed with the standard HashSet. Alas, it was a good deal slower. Profiling led me to conclude that the problem lay in the way variable accessor methods work in traits: control flow runs through an interface, a static method, and an instance method in such a way that the JVM doesn't realize it can inline an access to an object's own instance variable.

Is-A vs. Has-A

So I went back to the drawing board and came up with:

abstract class HAMT { ... }
class EmptyNode extends HAMT { ... }
class InternalNode extends HAMT { ... array of nodes ... }
class Leaf1 extends HAMT { ... leaf node with one element ... }
class HashSet (hamt: HAMT)
extends ... Scala collection stuff ... {
def this () = ... constructor for an empty set ...
...
}

Now a HashSet has-a HAMT (note the parameter to HashSet's primary constructor), but is-not-a HAMT. Not quite the way I had thought of it originally, but not inconsistent.

Performancewise, reading this data structure runs just a wee bit slower than the standard HashSet. In fact, the difference is small enough that it might be accounted for entirely by the extra level of reference indirection entailed in the has-a relationship (the standard HashSet, having its own implementation, needs no such indirection).

Many of the variations I tried out did result in faster adds and removes than the current Scala implementation, and also in more compact data structures. This is not too surprising, since the 2.8.0 scala.collection.immutable.HashSet is a preliminary implementation. Comments in the source code say that specialized node types exist to optimize small numbers of leaves, but those nodes aren't there when you read the code. (I'm not sure why modifying the official HashSet is so slow compared to mine. It may be because I do have several different types of specialized nodes, and those nodes have custom versions of the add and remove methods.)

Public Constructors Considered Harmful

The slightly slower read time of the has-a approach irks me. I'd like to speed it up (and rescue my original intuition that a hashed set is-a HAMT) by doing this:

abstract class HAMT { ... }
class EmptyNode extends HAMT { ... }
class InternalNode extends HAMT { ... array of nodes ... }
class Leaf1 extends HAMT { ... leaf node with one element ... }
trait HashSet extends ... Scala collection stuff ... { ... }
class EmptyNodeHashSet extends HashSet with EmptyNode
class InternalNodeHashSet extends HashSet with InternalNode
class Leaf1HashSet extends HashSet with Leaf1

This looks a lot like my first hierarchy, except that HashSet is now a trait, and the HAMTs are now classes. Data members are all inherited from classes, not traits, which should be fast.

But something is broken: you can no longer say new HashSet(). HashSet is now a trait, not a class, and so you can't invoke a constructor on it. You can package up a HashSet with a nice factory method (as indeed Scala does for the class scala.collection.immutable.HashSet), but to get rid of the public constructor, you'd have to remove something from the published interface.

So now I finally come to the point of my post (you knew I would—or did you?): public constructors are bad. They commit the implementation of a type to be a class, even though the day may come when you want your type to graduate to traithood.

I'm not the only one who's cranky about this; Gilad Bracha has a post that comes at the topic from a slightly less Javacentric point of view.

A better language than we have now could keep most of the beloved Java-style OO paraphernalia (like insisting on completely constructed objects) but refuse to allow constructors to be more visible than protected. This would force object construction by anyone who's not a subtype of the type in question to go through a factory method. And a really good language would try to generate the appropriate factory methods automatically, as Scala does for case classes.