In the following example, one thread is sending "messages" via a ByteBuffer which is the consumer is taking. The best performance is very good but its not consistent.
public class Main {
public static void main(String... args) throws IOException {
for (int i = 0; i < 10; i++)
doTest();
}
public static void doTest() {
final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);
final ByteBuffer readBuffer = writeBuffer.slice();
final AtomicInteger readCount = new PaddedAtomicInteger();
final AtomicInteger writeCount = new PaddedAtomicInteger();
for(int i=0;i<3;i++)
performTiming(writeBuffer, readBuffer, readCount, writeCount);
System.out.println();
}
private static void performTiming(ByteBuffer writeBuffer, final ByteBuffer readBuffer, final AtomicInteger readCount, final AtomicInteger writeCount) {
writeBuffer.clear();
readBuffer.clear();
readCount.set(0);
writeCount.set(0);
Thread t = new Thread(new Runnable() {
@Override
public void run() {
byte[] bytes = new byte[128];
while (!Thread.interrupted()) {
int rc = readCount.get(), toRead;
while ((toRead = writeCount.get() - rc) <= 0) ;
for (int i = 0; i < toRead; i++) {
byte len = readBuffer.get();
if (len == -1) {
// rewind.
readBuffer.clear();
// rc++;
} else {
int num = readBuffer.getInt();
if (num != rc)
throw new AssertionError("Expected " + rc + " but got " + num) ;
rc++;
readBuffer.get(bytes, 0, len - 4);
}
}
readCount.lazySet(rc);
}
}
});
t.setDaemon(true);
t.start();
Thread.yield();
long start = System.nanoTime();
int runs = 30 * 1000 * 1000;
int len = 32;
byte[] bytes = new byte[len - 4];
int wc = writeCount.get();
for (int i = 0; i < runs; i++) {
if (writeBuffer.remaining() < len + 1) {
// reader has to catch up.
while (wc - readCount.get() > 0) ;
// rewind.
writeBuffer.put((byte) -1);
writeBuffer.clear();
}
writeBuffer.put((byte) len);
writeBuffer.putInt(i);
writeBuffer.put(bytes);
writeCount.lazySet(++wc);
}
// reader has to catch up.
while (wc - readCount.get() > 0) ;
t.interrupt();
t.stop();
long time = System.nanoTime() - start;
System.out.printf("Message rate was %.1f M/s offsets %d %d %d%n", runs * 1e3 / time
, addressOf(readBuffer) - addressOf(writeBuffer)
, addressOf(readCount) - addressOf(writeBuffer)
, addressOf(writeCount) - addressOf(writeBuffer)
);
}
// assumes -XX:+UseCompressedOops.
public static long addressOf(Object... o) {
long offset = UNSAFE.arrayBaseOffset(o.getClass());
return UNSAFE.getInt(o, offset) * 8L;
}
public static final Unsafe UNSAFE = getUnsafe();
public static Unsafe getUnsafe() {
try {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
return (Unsafe) field.get(null);
} catch (Exception e) {
throw new AssertionError(e);
}
}
private static class PaddedAtomicInteger extends AtomicInteger {
public long p2, p3, p4, p5, p6, p7;
public long sum() {
// return 0;
return p2 + p3 + p4 + p5 + p6 + p7;
}
}
}
prints the timings for the same block of data. The numbers at the end are the relative addresses of the objects which show they are layed out in cache the same each time. Running longer tests of 10 shows that a given combination produces the same performance repeatedly.
Message rate was 63.2 M/s offsets 136 200 264
Message rate was 80.4 M/s offsets 136 200 264
Message rate was 80.0 M/s offsets 136 200 264
Message rate was 81.9 M/s offsets 136 200 264
Message rate was 82.2 M/s offsets 136 200 264
Message rate was 82.5 M/s offsets 136 200 264
Message rate was 79.1 M/s offsets 136 200 264
Message rate was 82.4 M/s offsets 136 200 264
Message rate was 82.4 M/s offsets 136 200 264
Message rate was 34.7 M/s offsets 136 200 264
Message rate was 39.1 M/s offsets 136 200 264
Message rate was 39.0 M/s offsets 136 200 264
Each set of buffers and counter are tested three times and those buffers appear to give similar results. SO I believe there is something about the way these buffers are layed out in memory I am not seeing.
Is there anything which might give the higher performance more often? It looks like a cache collision but I can't see where this could be happening.
BTW: M/s
is millions of messages per second and is more than anyone is likely to need, but it would be good to understand how to make it consistently fast.
EDIT: Using synchronized with wait and notify makes the result much more consistent. But not faster.
Message rate was 6.9 M/s
Message rate was 7.8 M/s
Message rate was 7.9 M/s
Message rate was 6.7 M/s
Message rate was 7.5 M/s
Message rate was 7.7 M/s
Message rate was 7.3 M/s
Message rate was 7.9 M/s
Message rate was 6.4 M/s
Message rate was 7.8 M/s
EDIT: By using task set, I can make the performance consistent if I lock the two threads to change the same core.
Message rate was 35.1 M/s offsets 136 200 216
Message rate was 34.0 M/s offsets 136 200 216
Message rate was 35.4 M/s offsets 136 200 216
Message rate was 35.6 M/s offsets 136 200 216
Message rate was 37.0 M/s offsets 136 200 216
Message rate was 37.2 M/s offsets 136 200 216
Message rate was 37.1 M/s offsets 136 200 216
Message rate was 35.0 M/s offsets 136 200 216
Message rate was 37.1 M/s offsets 136 200 216
If I use any two logical threads on different cores, I get the inconsistent behaviour
Message rate was 60.2 M/s offsets 136 200 216
Message rate was 68.7 M/s offsets 136 200 216
Message rate was 55.3 M/s offsets 136 200 216
Message rate was 39.2 M/s offsets 136 200 216
Message rate was 39.1 M/s offsets 136 200 216
Message rate was 37.5 M/s offsets 136 200 216
Message rate was 75.3 M/s offsets 136 200 216
Message rate was 73.8 M/s offsets 136 200 216
Message rate was 66.8 M/s offsets 136 200 216
EDIT: It appears that triggering a GC will shift the behaviour. These show repeated test on the same buffer+counters with a manually trigger GC halfway.
faster after GC
Message rate was 27.4 M/s offsets 136 200 216
Message rate was 27.8 M/s offsets 136 200 216
Message rate was 29.6 M/s offsets 136 200 216
Message rate was 27.7 M/s offsets 136 200 216
Message rate was 29.6 M/s offsets 136 200 216
[GC 14312K->1518K(244544K), 0.0003050 secs]
[Full GC 1518K->1328K(244544K), 0.0068270 secs]
Message rate was 34.7 M/s offsets 64 128 144
Message rate was 54.5 M/s offsets 64 128 144
Message rate was 54.1 M/s offsets 64 128 144
Message rate was 51.9 M/s offsets 64 128 144
Message rate was 57.2 M/s offsets 64 128 144
and slower
Message rate was 61.1 M/s offsets 136 200 216
Message rate was 61.8 M/s offsets 136 200 216
Message rate was 60.5 M/s offsets 136 200 216
Message rate was 61.1 M/s offsets 136 200 216
[GC 35740K->1440K(244544K), 0.0018170 secs]
[Full GC 1440K->1302K(244544K), 0.0071290 secs]
Message rate was 53.9 M/s offsets 64 128 144
Message rate was 54.3 M/s offsets 64 128 144
Message rate was 50.8 M/s offsets 64 128 144
Message rate was 56.6 M/s offsets 64 128 144
Message rate was 56.0 M/s offsets 64 128 144
Message rate was 53.6 M/s offsets 64 128 144
EDIT: Using @BegemoT's library to print the core id used I get the following on a 3.8 GHz i7 (home PC)
Note: the offsets are incorrect by a factor of 8. As the heap size was small, the JVM doesn't multiply the reference by 8 like it does with a heap which is larger (but less than 32 GB).
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 54.4 M/s offsets 3392 3904 4416
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#6]
Message rate was 54.2 M/s offsets 3392 3904 4416
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 60.7 M/s offsets 3392 3904 4416
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 25.5 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 25.9 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 26.0 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 61.0 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 61.8 M/s offsets 1088 1600 2112
writer.currentCore() -> Core[#0]
reader.currentCore() -> Core[#5]
Message rate was 60.7 M/s offsets 1088 1600 2112
You can see that the same logical threads are being used, but the performance varies, between runs, but not within a run (within a run the same objects are used)
I have found the problem. It was a memory layout issue but I could see a simple way to resolve it. ByteBuffer cannot be extended so you can't add padding so I create an object I discard.
final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);
final ByteBuffer readBuffer = writeBuffer.slice();
new PaddedAtomicInteger();
final AtomicInteger readCount = new PaddedAtomicInteger();
final AtomicInteger writeCount = new PaddedAtomicInteger();
Without this extra padding (of the object which is not used), the results look like this on a 3.8 GHz i7.
Message rate was 38.5 M/s offsets 3392 3904 4416
Message rate was 54.7 M/s offsets 3392 3904 4416
Message rate was 59.4 M/s offsets 3392 3904 4416
Message rate was 54.3 M/s offsets 1088 1600 2112
Message rate was 56.3 M/s offsets 1088 1600 2112
Message rate was 56.6 M/s offsets 1088 1600 2112
Message rate was 28.0 M/s offsets 1088 1600 2112
Message rate was 28.1 M/s offsets 1088 1600 2112
Message rate was 28.0 M/s offsets 1088 1600 2112
Message rate was 17.4 M/s offsets 1088 1600 2112
Message rate was 17.4 M/s offsets 1088 1600 2112
Message rate was 17.4 M/s offsets 1088 1600 2112
Message rate was 54.5 M/s offsets 1088 1600 2112
Message rate was 54.2 M/s offsets 1088 1600 2112
Message rate was 55.1 M/s offsets 1088 1600 2112
Message rate was 25.5 M/s offsets 1088 1600 2112
Message rate was 25.6 M/s offsets 1088 1600 2112
Message rate was 25.6 M/s offsets 1088 1600 2112
Message rate was 56.6 M/s offsets 1088 1600 2112
Message rate was 54.7 M/s offsets 1088 1600 2112
Message rate was 54.4 M/s offsets 1088 1600 2112
Message rate was 57.0 M/s offsets 1088 1600 2112
Message rate was 55.9 M/s offsets 1088 1600 2112
Message rate was 56.3 M/s offsets 1088 1600 2112
Message rate was 51.4 M/s offsets 1088 1600 2112
Message rate was 56.6 M/s offsets 1088 1600 2112
Message rate was 56.1 M/s offsets 1088 1600 2112
Message rate was 46.4 M/s offsets 1088 1600 2112
Message rate was 46.4 M/s offsets 1088 1600 2112
Message rate was 47.4 M/s offsets 1088 1600 2112
with the discarded padded object.
Message rate was 54.3 M/s offsets 3392 4416 4928
Message rate was 53.1 M/s offsets 3392 4416 4928
Message rate was 59.2 M/s offsets 3392 4416 4928
Message rate was 58.8 M/s offsets 1088 2112 2624
Message rate was 58.9 M/s offsets 1088 2112 2624
Message rate was 59.3 M/s offsets 1088 2112 2624
Message rate was 59.4 M/s offsets 1088 2112 2624
Message rate was 59.0 M/s offsets 1088 2112 2624
Message rate was 59.8 M/s offsets 1088 2112 2624
Message rate was 59.8 M/s offsets 1088 2112 2624
Message rate was 59.8 M/s offsets 1088 2112 2624
Message rate was 59.2 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.9 M/s offsets 1088 2112 2624
Message rate was 60.6 M/s offsets 1088 2112 2624
Message rate was 59.6 M/s offsets 1088 2112 2624
Message rate was 60.3 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.9 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.7 M/s offsets 1088 2112 2624
Message rate was 61.6 M/s offsets 1088 2112 2624
Message rate was 60.8 M/s offsets 1088 2112 2624
Message rate was 60.3 M/s offsets 1088 2112 2624
Message rate was 60.7 M/s offsets 1088 2112 2624
Message rate was 58.3 M/s offsets 1088 2112 2624
Unfortunately there is always the risk that after a GC, the objects will not be laid out optimally. The only way to resolve this may be to add padding to the original class. :(
I'm not an expert in the area of processor caches but I suspect your issue is essentially a cache issue or some other memory layout problem. Repeated allocation of the buffers and counters without cleaning up the old objects may be causing you to periodically get a very bad cache layout, which may lead to your inconsistent performance.
ReplyDeleteUsing your code and making a few mods I have been able to make the performance consistent (my test machine is Intel Core2 Quad CPU Q6600 2.4GHz w/ Win7x64 - so not quite the same but hopefully close enough to have relevant results). I've done this in two different ways both of which have roughly the same effect.
First, move creation of the buffers and counters outside of the doTest method so that they are created only once and then reused for each pass of the test. Now you get the one allocation, it sits nicely in the cache and performance is consistent.
Another way to get the same reuse but with "different" buffers/counters was to insert a gc after the performTiming loop:
for ( int i = 0; i < 3; i++ )
performTiming ( writeBuffer, readBuffer, readCount, writeCount );
System.out.println ();
System.gc ();
Here the result is more or less the same - the gc lets the buffers/counters be reclaimed and the next allocation ends up reusing the same memory (at least on my test system) and you end up in cache with consistent performance (I also added printing of the actual addresses to verify reuse of the same locations). My guess is that without the clean up leading to reuse you eventually end up with a buffer allocated that doesn't fit into the cache and your performance suffers as it is swapped in. I suspect that you could do some strange things with order of allocation (like you can make the performance worse on my machine by moving the counter allocation in front of the buffers) or creating some dead space around each run to "purge" the cache if you didn't want to eliminate the buffers from a prior loop.
Finally, as I said, processor cache and the fun of memory layouts aren't my area of expertise so if the explanations are misleading or wrong - sorry about that.
you are busy waiting. that is always a bad idea in user code.
ReplyDeletereader:
while ((toRead = writeCount.get() - rc) <= 0) ;
writer:
while (wc - readCount.get() > 0) ;
As a general approach to performance analysis:
ReplyDeleteTry jconsole. Start your app, and while it's running type jconsole in separate terminal window. This will bring up the Java Console GUI, which allows you to connect to a running JVM, and see performance metrics, memory usage, Thread count and status, etc.
Basically you're going to have to figure out the correlation between the speed fluxuations and what you see the JVM doing. It could also be helpful to bring up your task manager and see if your system is actually just busy doing other stuff (paging to the disk due to low memory, busy with a heavy background task, etc.) and put it side-by-side with the jconsole window.
One other alternative is launching the JVM with the -Xprof option which outputs relative time spent in various methods on a per-thread basis. Ex. java -Xprof [your class file]
Finally, there is also JProfiler, but it's a commercial tool, if that matters to you.
EDIT: It appears that triggering a GC will shift the behaviour. These
ReplyDeleteshow repeated test on the same buffer+counters with a manually trigger
GC halfway.
GC means reaching a safepoint which means all threads have stopped executing bytecode & the GC threads have work to do. This can have various side effects. For example, in the absence of any explicit cpu affinity, you may restart execution on a different core or cache lines may have been refreshed. Can you track which cores your threads are running on?
What CPUs are these? Have you done anything about power management to prevent them dropping down into lower p and/or c states? Perhaps 1 thread is being scheduled onto a core that was in a different p state hence shows a different performance profile.
EDIT
I tried running your test on a workstation running x64 linux with 2 slightly old quadcore xeons (E5504), it's generally consistent within a run (~17-18M/s) with occasion runs much slower which appears to generally correspond with thread migrations. I didn't plot this rigorously. Therefore it appears your problem might be CPU architecture specific. You mention you're running an i7 at 4.6GHz, is that a typo? I thought the i7 topped out at 3.5GHz with a 3.9Ghz turbo mode (with an earlier version 3.3GHz to 3.6GHz turbo). Either way are you sure you're not seeing an artifact of turbo mode kicking in then dropping out? You could try repeating the test with turbo disabled to be sure.
A couple of other points
the padding values are all 0, are you sure there isn't some special treatment being meted out to uninitialised values? you could consider using the LogCompilation option to understand how the JIT is treating that method
Intel VTune is free for 30 day evaluation, if this is a cache line problem then you could use that to determine what the problem is on your host
How do you actually pin your threads to cores? taskset is not the best way to pin thread to cores, since it just pin process to cores -- and all its threads will share this cores. Recall, java have many internal threads for it's own needs, so all them will contend on cores you'll bind them to.
ReplyDeleteTo have more consistent results you can use JNA to call sched_setaffinity() from only threads you need to. It will pin only your benchmarking threads to exact cores, while other java threads will spread on other free cores, having less influence on your code behavior.
By the way, I've have similar issues with unstable performance while benchmarking highly optimized concurrent code. It seems, like where are too many things which can influence performance drastically while it is close to hardware limits. You should tune your OS somehow, to give your code the possibility to make it best, or just use many experiments and use math to have averages and confidence intervals.
There would certainly be some inconsistency brought in when a full GC runs, but that's not so often. Try modifying the stack size (Xss) to say 32M and see if that helps. Also, try clearing the 2 buffers at the end of each test to make it even easier for the GC to know that the contents can be collected. Interestingly, you have used thread.stop() which is deprecated and absolutely not recommended. I would suggest changing that too.
ReplyDelete