This post is a followup on my previous post about copying objects in Java. After I published that post I got a question from Sven Reimers (@SvenNB) why there is a big performance different between clone and copying via constructor. In this post I will try to answer this question.
Code in question
Just to recap what we are looking at. There are 2 classes implementing
- Copy via
- Copy via constructor:
Both of these classes inherit from the common base class that defines state to be copied:
Clone under the hood
clone() method as
native thus giving JVM possibility to use intrinsics. And in fact this is what OpenJDK JVM implementation is doing under the hood:
Unfortunately I was not able to find exactly how such intrinsified
clone() method call would look like. If any of you knows the answer I would be more than happy to hear about it!
Test code and results
This time I won’t be using JMH running my tests, because I just need to force JVM to compile methods in question. For each case there is a dedicated test class (i.e.
TestConstructor.java) that invokes
copy() method 500 000 times during warmup phase and then another 10 000 000 during actual test phase. These numbers are not particularly relevant and they were chosen to ensure that JVM will compile copy methods into native code.
I will use 1.7.0_45 JDK version.
Here are test classes:
I ran both tests with
-XX:+PrintCompilation option and got the following results:
This by itself is not telling us much except that in the second (constructor) case there are 2 more entries that were compiled (i.e.
The real fun is to look into generated assembler code (i.e.
-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly). If you want to know how to dump assembly you can use java-print-assembly instructions provided by Nitsan Wakart on his blog. Links to proper binaries saved me tons of time today. 😉
I will show here only relevant part of the assembly dumps which contain only compiled
callCopy() method from each test class, because it is the one we are interested in:
As you can see clone case has much shorter assembler code and basically it is just an
*invokespecial clone invocation. Whereas in the constructor case we see much bigger assembler output and in essence it contains multiple
Eventually I managed to compile Intel Performance Counter Monitor 2.5.1 on my OS X 10.9.
Here are the results of running clone and constructor code under PCM (NB: I changed number of iterations to
test() method for this run):
What this output shows is that clone case is faster because amount of instructions executed is lower, i.e. there is less code to execute:
Instructions retired: 10 G ; Active cycles: 16 G ; Time (TSC): 3425 Mticks
Instructions retired: 11 G ; Active cycles: 17 G ; Time (TSC): 3988 Mticks