Static Analysis in Clojure: Java Interoperability

I'm writing this post in anticipation of the upcoming release of version 0.1.2 of Yagni. I've written about Yagni here before, and if you're interested in this sort of thing I'd recommend reading that post first to understand the methodology behind the analyzer.

TL;DR: Yagni is a static code analyzer for Clojure that traverses your codebase to find variables and declarations that are unreachable (that is to say, will not be referenced at runtime).

Today's post will be about the subtleties of getting a static analyzer in Clojure to work well with Java interfaces and classes. I should note up front that, at least for the moment, the focus here is on Yagni's specific use case: analyzing Clojure code that includes special forms for Java interoperability. There will be no analysis of Java code.

A Primer on Java Interop in Clojure

While detailed writeups of Java interoperability in Clojure can be found in a number of places, I'm going to limit myself here to a specific subset of interoperability forms, namely: defprotocol, deftype and defrecord. These forms are used within Clojure to generate Java interfaces and classes.

defprotocol

defprotocol, as the name suggests, is used to define protocols, and generates Java interface code. An example protocol might look like this:

(defprotocol AProtocol
  "A doc string for AProtocol abstraction"
  (bar [a b] "bar docs")
  (baz [a] [a b] "baz docs"))

Any class seeking to satisfy this protocol must satisfy the interface of having the two methods (bar and baz) and their signatures implemented. In this case bar takes two arguments, and baz is a polymorphic function that can take either one or two arguments.

If this protocol was specified in lib.ns, the bytecode for the interface would sit on the JVM at lib.ns.AProtocol.

deftype

deftype is a Java class generation form. Philosophically, deftype is intended for use in declaring and defining classes that are unique data structures rather than simple data holders. An example, implementing the protocol we defined above, looks like this:

(deftype Foo [a b c]
  AProtocol
  (bar [a b] (println x y))
  (baz [a] a)
  (baz [a b] (+ a b)))

As with AProtocol, if defined in lib.ns the generated class bytecode here would be referenced at lib.ns.Foo.

defrecord

Like deftype, defrecord is a Java class generation form. However, the underlying philosophy of defrecord is a little different, and is focused instead around the notion of a custom data holder class. In this regard, defrecord forms function much like traditional Clojure maps, but with generated class bytecode. Like deftype, defrecord forms can also satisfy interfaces and have additional methods.

An example declaration (stolen from Clojure for the Brave and True):

(defprotocol WereCreature
  (full-moon-behavior [x]))

(defrecord WereWolf [name title]
  WereCreature
  (full-moon-behavior [x]
    (str name " will howl and murder")))

The Problem Statement

Okay, now that we've covered the basics of JVM interoperability, let's talk about some of the things that our static analyzer will need to address when dealing with these special forms.

No Direct Vars for Classes

One of the most immediate problems we have is that deftype and defrecord forms don't actually intern a var for those classes in the namespace in question. That is to say: while there's JVM bytecode for lib.ns.Foo and lib.ns.WereWolf, there are no corresponding lib.ns/Foo or lib.ns/WereWolf vars. Since Yagni builds its reference graph by looking for definitions using ns-interns (which returns a map of interned vars in a namespace), the lack of vars means Yagni won't know a class exists.

defprotocol in this case is also tricky. On the one hand, there is an interned var for the protocol. However, references to protocols in deftype, defrecord forms are actually references to the class, not the var.

This creates two problems: first, how do we know if there's a deftype or defrecord form in a namespace, and second, when references to all of these forms are references to the class, how should we track them?

Generator Functions

While deftype and defrecord forms don't intern a var for themselves directly, the macroexpansion of the forms does intern generator functions in the namespace of the declaration. This means the following functions are interned in our namespace automatically: lib.ns/->Foo, lib.ns/->WereWolf and lib.ns/map->WereWolf.

We don't actually want Yagni to report on the use of these constructors directly since that would be rather noisy. For instance, if you only ever used the ->WereWolf constructor, having Yagni warn about the lack of usage of the map->WereWolf constructor isn't actually something you're likely to care about. You might not even use one of these generator functions at all, and instead use...

Class Constructors

In addition to the generator functions above, Clojure has additional special syntax for class construction. This takes one of the following two forms:

;; these do the same thing
(WereWolf. "Abraham" "Lincoln")
(new WereWolf "Abraham" "Lincoln")

The first form will macroexpand to the second...sort of. At the lowest level the macroexpand works, but when macroexpanding from an outer form, it won't. To give an example:

user=> (defrecord X [a])
user.X
user=> (macroexpand `(println (X. "a")))
(clojure.core/println (user.X. "a"))
user=> (macroexpand `(X. "a"))
(new user.X "a")

Unfortunately, due to the way Yagni's form walker works, we can't recursively call macroexpand each time we look at a new form, otherwise we end up trying to call macroexpand new, which throws a RunTimeException since Clojure can't resolve new. Irritating.

This means Yagni needs to recognize both WereWolf. and WereWolf as references to the lib.ns.WereWolf class.

Ultimately, our problem space looks something like this:

The Solution[s]

When one looks at the problems described above, they boil down into one meta-problem, which is that we can't tell whether or not a given protocol or class is actually being used. In the case of protocols, the standard syntax involves a reference to the class, rather than to the protocol's var, and in the case of the classes we have between 3 and 4 possible inbound reference possibilities between the class' generator functions and class constructors.

Identifying Classes and Extending the Graph

The path I've taken with the 0.1.2 release has been to leverage the existence of the interned generator functions as a proxy for identifying possible class definitions. Specifically, Yagni now checks to see if functions named lib.ns/->X or lib.ns/map->X can resolve to classes named lib.ns.X. When they can, Yagni adds a new node to the graph with the name lib.nx/X (where a var would exist if deftype and defrecord created self-named vars) and adds an edge from the generator function to that node.

When Yagni encounters a reference to an interface class / protocol, it adds an edge from the form it's walking to the protocol's var rather than to the protocol's class.

Similarly, while Clojure's resolve function will correctly resolve a direct reference to the WereWolf class, trying to resolve the syntactic sugar for a new WereWolf (i.e., (WereWolf.)) won't. So, as with the generator functions, now when Yagni hits a symbol that ends in a period, it checks to see if that symbol can otherwise resolve to a class. If it can, Yagni knows that its looking at a class constructor, and adds an edge to the var-like node (the lib.ns/X mentioned above - in this case lib.ns/WereWolf) created when Yagni traversed the generator function.

Assuming we've got some entrypoint lib.otherns/somefn, Yagni's graph might now look something like this:

Compressing the Graph

The methodology described above serves us handily in the graph traversal phase of Yagni's analysis, but ends up being a little too noisy at report time. For instance, looking at that last diagram in the previous section, Yagni would warn us that we're not using any of its generator functions. Of course, we could simply remove the nodes for the generator functions, but then deftype and defrecord forms that had generator function references but not class constructor references would show up as parents rather than children, which is incorrect.

The solution here is to compress the graph by first changing any edges pointing to the generator functions to point directly to the nodes for the corresponding deftype or defrecord forms. We then remove the generator function nodes from the graph, leaving us with a graph that now looks like this:

Perfect.

Conclusion

Funnily enough, when I initially wrote about Yagni, I had some ideas about what the project's roadmap would be, but improving the Java interoperability wasn't on my radar at all. One of the really cool things about open source work is that sometimes your projects take you in a direction you hadn't considered (even if you should have)!

There's still quite a bit more work to be done on Yagni, but this week's release will hopefully be a major step forward for teams working in Clojure with generated Java classes.

As always, if there are additional features you'd like to see implemented, feel free to file an issue - or better yet, a pull request!

Until next time -

~ V

Discuss this post on Hacker News or on Twitter (@venantius)

Thanks to Bill Cauchois (@wcauchois) for reading an early draft of this post.