Java and the Computer Language Shootout

The Java VM is an impressive beast. In theory, features like dynamic optimization and compilation and concurrent garbage collection promise great performance gains. But in benchmarks at the Computer Language Shootout, performance numbers still appear to be lacking.

The Computer Language Shootout is good research. Complete disclosure of methods (for maximum reproducibility), an adversarial method for developing implementations and the breadth of language and compiler coverage all make for a solid data set. However I began to wonder about the fact that for each benchmark results are only published for one set of inputs. Update: As Isaac Gouy points out, there are actually three results reported, see comments below. For example for the binarytree benchmark, only one tree-depth is used, specifically 16. This value may be an arbitrary choice, it may be chosen to keep the runtimes in some range, regardless, there is only one.

For a preliminary investigation, we chose to run the binarytree benchmark in three languages, C for it’s perceived performance advantages, OCaml because it showed the best performance in the original investigation, and Java because it is important to us. Running the tests for a range of inputs was simple, given the availability of the code and the compile commands.

My data was produced using substantially the same compilation and run arguments as used in the original benchmark investigation (see notes at the bottom for the configurations I used). However the benchmarks were run on a completely different platform (Dual Intel Xeon CPU 3.06GHz, HyperThreading enabled, 2GB memory), so comparing our results to the Shootout results is meaningless.

The first thing we looked at was real (or wall-clock) time versus N, where N is essentially the depth of the binary tree used for the benchmark. Given that we expect time to grow proportional to 2N, we use a logarithmic scale for the time axis.

real.png
The main thing that stands out to us is that the C implementation is faster than Java for only very small values of N. The runtime of each implementation grows exponentially as expected.

Things get a little bit more interesting when we look at actual CPU time:

CPU Usage vs N
Notice that once again, Java fares better than C for a large range of values. Remember this is a dual-processor architecture with HyperThreading enabled, meaning a total of 4 “virtual” processors available to the benchmark. We immediately noticed that the C and OCaml implementations use only a single processor, whereas the Java implementation was able to use at least two, possibly more. This is most likely due to the fact that the Java VM was able to run the garbage collector on a different processor than the main computation.

To be fair, the additional CPU usage required by the Java implementation is quite large: 89 seconds (Java) vs 58 seconds (C) for N=20 and 180 seconds (Java) vs 127 seconds (C). Our preliminary investigation did not reveal the cause of this crossover, but there are several obvious potential culprits: interaction with the virtual memory subsystem, or reaching some threshold in the Java generational garbage collector.

Regardless, adding another dimension to the benchmark results by publishing times for a range of input values appears to shed some more light on language performance investigations.

Notes

  • gcc version 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)
  • Java HotSpot(TM) Server VM (build 1.5.0_08-b03, mixed mode)
  • ocamlopt version 3.09.2
  • Linux kernel 2.6.17-11-server

6 thoughts on “Java and the Computer Language Shootout

  1. The binary-tree benchmark from the shootout is in some ways a weird and artificial case. It’s entire purpose is to exercise the GC with lots of little allocations and deallocations, and as such, functional languages like OCaml have a big advantage. Functional languages encourage doing lots of little allocations, and their GCs are optimized accordingly.

    Even for a functional language, OCaml’s has an unusually fast GC. Runtimes like C#’s and Java’s just aren’t prepared for the kind of allocation rates you see in OCaml. Some evidence of this comes from F#, which is an OCaml-variant built for the dotNET runtime. It turns out that for allocation-heavy workloads, F# falls pretty far behind OCaml in performance. Don Syme (the author of F#, who works at Microsoft) said that he was unable at first to convince the guys who worked on the runtime to take the issue seriously — they basically saw allocation rates that high as an indication of a bug in the program.

    Another reason that OCaml’s GC is faster than Java’s is that the GC, and OCaml in general, can take advantage of only one CPU at a time. Concurrent GCs are complicated and basically never as fast as single-threaded ones. As such, OCaml wins hands-down on single-threaded performance, but if you want to take advantage of multiple CPUs, you need to create multiple OCaml processes and use message-passing to communicate between them.

    In general, one should beware of micro-benchmarks. In some ways, binary-trees is designed to show off Java at its weakest and OCaml at its strongest. (That said, I far prefer programming in OCaml to programming in Java, but that’s a story for another time…)

  2. Graham Miller wrote “However I began to wonder about the fact that for each benchmark results are only published for one set of inputs.”
    Simply wrong – we show results for 3 different input values. Every benchmark page has a “full data” link.

    Graham Miller wrote “This value may be an arbitrary choice, it may be chosen to keep the runtimes in some range…”
    It was chosen to avoid some languages failing because they used all available memory.

  3. Yaron Minsky wrote “…binary-trees is designed to show off Java at its weakest and OCaml at its strongest…”
    binary-trees was adapted in a straightforward manner from Hans Boehm’s GCBench, which was “… designed to be more representative of real applications than other Java GC benchmarks …”.

    binary-trees was absolutely not designed to “show off Java at its weakest and OCaml at its strongest”.

  4. Isaac, thanks for the comment. I have to admit that I missed the “full data” link when I was playing around with your site. Three is definitely better than one. It is still interesting that for N=16 we still get a different absolute ordering for the runtime of our three chosen languages. Here’s the relevant query on the CLS site. And obviously I would love to see this expanded to larger numbers of input sizes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s