Friday, August 6, 2010

Public Constructors Considered Harmful: Or, the Amazing Misadventures of HashSet and HashMap

A few months ago I thought I'd try my hand at an implementation of hash array mapped tries. This is the data structure used by Scala (and Clojure) for immutable hash sets and maps. The data structure provides copy-on-write semantics with reasonably fast update and, after each update, a high degree of sharing between the old and new versions of the structure.

My purposes in this adventure were twofold:
  • To see whether I could come up with variations that improve on the performance of the standard Scala implementations.
  • To see whether I could browbeat Scala's type system into letting me use a single implementation for both sets and maps.

I managed to achieve the latter goal, although I went down a number of blind alleys before I got there. This was my first project with much use of higher-kinded types, and it took me a while to wrap my mind around the idea that I was, in effect, writing compiletime functions of types that produce new types. That's a powerful idea, but not what I wanted to get to in this post.

Trait Hierarchies with Lots of Variables Can Be Slow

My first attempt at the hash array mapped trie class hierarchy looked something like this (HashMap, type parameters, and most other details omitted—and we all know what's in the details):

trait HAMT { ... }
trait EmptyNode extends HAMT { ... }
trait InternalNode extends HAMT { ... array of nodes ... }
trait Leaf1 extends HAMT { ... leaf node with one element ... }
class HashSet extends ... Scala collection stuff ...
with EmptyNode {
def this () = ... constructor for an empty set ...
...
}
class InternalNodeHashSet extends HashSet with InternalNode
class Leaf1HashSet extends HashSet with Leaf1

This struck me as a natural phrasing of the solution: each of the components of a hash trie is itself a valid hash trie (at least the way I wrote it), and also a valid hashed set or map.

I got this approach to work, and compared its speed with the standard HashSet. Alas, it was a good deal slower. Profiling led me to conclude that the problem lay in the way variable accessor methods work in traits: control flow runs through an interface, a static method, and an instance method in such a way that the JVM doesn't realize it can inline an access to an object's own instance variable.

Is-A vs. Has-A

So I went back to the drawing board and came up with:

abstract class HAMT { ... }
class EmptyNode extends HAMT { ... }
class InternalNode extends HAMT { ... array of nodes ... }
class Leaf1 extends HAMT { ... leaf node with one element ... }
class HashSet (hamt: HAMT)
extends ... Scala collection stuff ... {
def this () = ... constructor for an empty set ...
...
}

Now a HashSet has-a HAMT (note the parameter to HashSet's primary constructor), but is-not-a HAMT. Not quite the way I had thought of it originally, but not inconsistent.

Performancewise, reading this data structure runs just a wee bit slower than the standard HashSet. In fact, the difference is small enough that it might be accounted for entirely by the extra level of reference indirection entailed in the has-a relationship (the standard HashSet, having its own implementation, needs no such indirection).

Many of the variations I tried out did result in faster adds and removes than the current Scala implementation, and also in more compact data structures. This is not too surprising, since the 2.8.0 scala.collection.immutable.HashSet is a preliminary implementation. Comments in the source code say that specialized node types exist to optimize small numbers of leaves, but those nodes aren't there when you read the code. (I'm not sure why modifying the official HashSet is so slow compared to mine. It may be because I do have several different types of specialized nodes, and those nodes have custom versions of the add and remove methods.)

Public Constructors Considered Harmful

The slightly slower read time of the has-a approach irks me. I'd like to speed it up (and rescue my original intuition that a hashed set is-a HAMT) by doing this:

abstract class HAMT { ... }
class EmptyNode extends HAMT { ... }
class InternalNode extends HAMT { ... array of nodes ... }
class Leaf1 extends HAMT { ... leaf node with one element ... }
trait HashSet extends ... Scala collection stuff ... { ... }
class EmptyNodeHashSet extends HashSet with EmptyNode
class InternalNodeHashSet extends HashSet with InternalNode
class Leaf1HashSet extends HashSet with Leaf1

This looks a lot like my first hierarchy, except that HashSet is now a trait, and the HAMTs are now classes. Data members are all inherited from classes, not traits, which should be fast.

But something is broken: you can no longer say new HashSet(). HashSet is now a trait, not a class, and so you can't invoke a constructor on it. You can package up a HashSet with a nice factory method (as indeed Scala does for the class scala.collection.immutable.HashSet), but to get rid of the public constructor, you'd have to remove something from the published interface.

So now I finally come to the point of my post (you knew I would—or did you?): public constructors are bad. They commit the implementation of a type to be a class, even though the day may come when you want your type to graduate to traithood.

I'm not the only one who's cranky about this; Gilad Bracha has a post that comes at the topic from a slightly less Javacentric point of view.

A better language than we have now could keep most of the beloved Java-style OO paraphernalia (like insisting on completely constructed objects) but refuse to allow constructors to be more visible than protected. This would force object construction by anyone who's not a subtype of the type in question to go through a factory method. And a really good language would try to generate the appropriate factory methods automatically, as Scala does for case classes.

3 comments:

  1. I also encountered the class/constructor/trait problem several times in my projects: I have something that I thought was a class, and then I realize that it's more an aspect of a behavior and so better modeled as a trait, and so replace constructor by factories. So far so good.

    The problem is that now, I can no longer do:

    val a = MyClassThatIsNowATrait with OtherAspect

    Because MyClassThatIsNowATrait is only a factory for the trait, and gives an actual instance.

    That's not really a problem when there is no logic in the constructor, or not too many values in the class, because in theses cases, I can just give the implementation on the fly:

    val a = new MyClassThatIsNowATrait with OtherAspect { /*ok, trait-which-was-a-class instantiation is dead simple */ }

    It's not bearable on the other case. So I could just provide factories for all the trait mixing possible. Well, even if it was possible (because the number of combination is really small), I loose a lot of trait flexibility in the process.


    So, what I would really like to do will be to have a way to give the factory a list of trait to mix with the instance (I don't see how it could be possible), or a way to mix an actual instance with traits.

    For that last solution, I believe somebody is working in a compiler plugin which relies on runtime byte code manipulation to achieve that.

    So, do you already encounter that problem, and now solutions for it ?

    ReplyDelete
  2. Fran├žois, you've found a rather large hole in my thinking on this topic. I had not given any thought to how my idea of wrapping nonpublic constructors in public factory methods might be realized for anonymous classes (what name do you use to invoke something without a name?).

    I did discover something funny while playing with Scala 2.8. Scala traits are, of course, abstract, but it appears that they can be pairwise concrete. For example, if you define trait A and trait B, you can't say new A or new B, but you can say new A with B (which is equivalent to new java.lang.Object with A with B). Of course, there must not be any methods that are abstract in both A and B in order for this to work.

    So then, why can't a trait be concrete by itself? Then a trait would just be a class with a default constructor.

    And of course you can have code that is invoked during trait construction, as in:

    scala> trait A { println("From A") }

    scala> trait B { println("From B") }

    scala> new A with B
    From A
    From B

    ReplyDelete
  3. Hello Archontophoenix,

    Thanks for your answer, I didn't know about paired traits, and so that made me think that perhaps we could just...
    8<------------------------------------
    scala> trait A { println("From A") }
    defined trait A

    scala> new AnyRef with A
    From A
    res0: java.lang.Object with A = $anon$1@1dfa490
    8<------------------------------------

    Oh yes, it seems to work. Not sure it solves anything about my concern with factories and actual instance, but it's interesting in itself.

    ReplyDelete