The Java VM is an impressive beast. In theory, features like dynamic optimization and compilation and concurrent garbage collection promise great performance gains. But in benchmarks at the Computer Language Shootout, performance numbers still appear to be lacking.
The Computer Language Shootout is good research. Complete disclosure of methods (for maximum reproducibility), an adversarial method for developing implementations and the breadth of language and compiler coverage all make for a solid data set. However I began to wonder about the fact that for each benchmark results are only published for one set of inputs. Update: As Isaac Gouy points out, there are actually three results reported, see comments below. For example for the
binarytree benchmark, only one tree-depth is used, specifically 16. This value may be an arbitrary choice, it may be chosen to keep the runtimes in some range, regardless, there is only one.
For a preliminary investigation, we chose to run the
binarytree benchmark in three languages, C for it’s perceived performance advantages, OCaml because it showed the best performance in the original investigation, and Java because it is important to us. Running the tests for a range of inputs was simple, given the availability of the code and the compile commands.
My data was produced using substantially the same compilation and run arguments as used in the original benchmark investigation (see notes at the bottom for the configurations I used). However the benchmarks were run on a completely different platform (Dual Intel Xeon CPU 3.06GHz, HyperThreading enabled, 2GB memory), so comparing our results to the Shootout results is meaningless.
The first thing we looked at was real (or wall-clock) time versus N, where N is essentially the depth of the binary tree used for the benchmark. Given that we expect time to grow proportional to 2N, we use a logarithmic scale for the time axis.
Things get a little bit more interesting when we look at actual CPU time:
Notice that once again, Java fares better than C for a large range of values. Remember this is a dual-processor architecture with HyperThreading enabled, meaning a total of 4 “virtual” processors available to the benchmark. We immediately noticed that the C and OCaml implementations use only a single processor, whereas the Java implementation was able to use at least two, possibly more. This is most likely due to the fact that the Java VM was able to run the garbage collector on a different processor than the main computation.
To be fair, the additional CPU usage required by the Java implementation is quite large: 89 seconds (Java) vs 58 seconds (C) for N=20 and 180 seconds (Java) vs 127 seconds (C). Our preliminary investigation did not reveal the cause of this crossover, but there are several obvious potential culprits: interaction with the virtual memory subsystem, or reaching some threshold in the Java generational garbage collector.
Regardless, adding another dimension to the benchmark results by publishing times for a range of input values appears to shed some more light on language performance investigations.
- gcc version 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)
- Java HotSpot(TM) Server VM (build 1.5.0_08-b03, mixed mode)
- ocamlopt version 3.09.2
- Linux kernel 2.6.17-11-server