Skip to main content

Ways to improve performance consistency



In the following example, one thread is sending "messages" via a ByteBuffer which is the consumer is taking. The best performance is very good but its not consistent.







public class Main {

public static void main(String... args) throws IOException {

for (int i = 0; i < 10; i++)

doTest();

}



public static void doTest() {

final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);

final ByteBuffer readBuffer = writeBuffer.slice();

final AtomicInteger readCount = new PaddedAtomicInteger();

final AtomicInteger writeCount = new PaddedAtomicInteger();



for(int i=0;i<3;i++)

performTiming(writeBuffer, readBuffer, readCount, writeCount);

System.out.println();

}



private static void performTiming(ByteBuffer writeBuffer, final ByteBuffer readBuffer, final AtomicInteger readCount, final AtomicInteger writeCount) {

writeBuffer.clear();

readBuffer.clear();

readCount.set(0);

writeCount.set(0);



Thread t = new Thread(new Runnable() {

@Override

public void run() {

byte[] bytes = new byte[128];

while (!Thread.interrupted()) {

int rc = readCount.get(), toRead;

while ((toRead = writeCount.get() - rc) <= 0) ;

for (int i = 0; i < toRead; i++) {

byte len = readBuffer.get();

if (len == -1) {

// rewind.

readBuffer.clear();

// rc++;

} else {

int num = readBuffer.getInt();

if (num != rc)

throw new AssertionError("Expected " + rc + " but got " + num) ;

rc++;

readBuffer.get(bytes, 0, len - 4);

}

}

readCount.lazySet(rc);

}

}

});

t.setDaemon(true);

t.start();

Thread.yield();

long start = System.nanoTime();

int runs = 30 * 1000 * 1000;

int len = 32;

byte[] bytes = new byte[len - 4];

int wc = writeCount.get();

for (int i = 0; i < runs; i++) {

if (writeBuffer.remaining() < len + 1) {

// reader has to catch up.

while (wc - readCount.get() > 0) ;

// rewind.

writeBuffer.put((byte) -1);

writeBuffer.clear();

}

writeBuffer.put((byte) len);

writeBuffer.putInt(i);

writeBuffer.put(bytes);

writeCount.lazySet(++wc);

}

// reader has to catch up.

while (wc - readCount.get() > 0) ;

t.interrupt();

t.stop();

long time = System.nanoTime() - start;

System.out.printf("Message rate was %.1f M/s offsets %d %d %d%n", runs * 1e3 / time

, addressOf(readBuffer) - addressOf(writeBuffer)

, addressOf(readCount) - addressOf(writeBuffer)

, addressOf(writeCount) - addressOf(writeBuffer)

);

}



// assumes -XX:+UseCompressedOops.

public static long addressOf(Object... o) {

long offset = UNSAFE.arrayBaseOffset(o.getClass());

return UNSAFE.getInt(o, offset) * 8L;

}



public static final Unsafe UNSAFE = getUnsafe();

public static Unsafe getUnsafe() {

try {

Field field = Unsafe.class.getDeclaredField("theUnsafe");

field.setAccessible(true);

return (Unsafe) field.get(null);

} catch (Exception e) {

throw new AssertionError(e);

}

}



private static class PaddedAtomicInteger extends AtomicInteger {

public long p2, p3, p4, p5, p6, p7;



public long sum() {

// return 0;

return p2 + p3 + p4 + p5 + p6 + p7;

}

}

}







prints the timings for the same block of data. The numbers at the end are the relative addresses of the objects which show they are layed out in cache the same each time. Running longer tests of 10 shows that a given combination produces the same performance repeatedly.







Message rate was 63.2 M/s offsets 136 200 264

Message rate was 80.4 M/s offsets 136 200 264

Message rate was 80.0 M/s offsets 136 200 264



Message rate was 81.9 M/s offsets 136 200 264

Message rate was 82.2 M/s offsets 136 200 264

Message rate was 82.5 M/s offsets 136 200 264



Message rate was 79.1 M/s offsets 136 200 264

Message rate was 82.4 M/s offsets 136 200 264

Message rate was 82.4 M/s offsets 136 200 264



Message rate was 34.7 M/s offsets 136 200 264

Message rate was 39.1 M/s offsets 136 200 264

Message rate was 39.0 M/s offsets 136 200 264







Each set of buffers and counter are tested three times and those buffers appear to give similar results. SO I believe there is something about the way these buffers are layed out in memory I am not seeing.





Is there anything which might give the higher performance more often? It looks like a cache collision but I can't see where this could be happening.





BTW: M/s is millions of messages per second and is more than anyone is likely to need, but it would be good to understand how to make it consistently fast.








EDIT: Using synchronized with wait and notify makes the result much more consistent. But not faster.







Message rate was 6.9 M/s

Message rate was 7.8 M/s

Message rate was 7.9 M/s

Message rate was 6.7 M/s

Message rate was 7.5 M/s

Message rate was 7.7 M/s

Message rate was 7.3 M/s

Message rate was 7.9 M/s

Message rate was 6.4 M/s

Message rate was 7.8 M/s










EDIT: By using task set, I can make the performance consistent if I lock the two threads to change the same core.







Message rate was 35.1 M/s offsets 136 200 216

Message rate was 34.0 M/s offsets 136 200 216

Message rate was 35.4 M/s offsets 136 200 216



Message rate was 35.6 M/s offsets 136 200 216

Message rate was 37.0 M/s offsets 136 200 216

Message rate was 37.2 M/s offsets 136 200 216



Message rate was 37.1 M/s offsets 136 200 216

Message rate was 35.0 M/s offsets 136 200 216

Message rate was 37.1 M/s offsets 136 200 216



If I use any two logical threads on different cores, I get the inconsistent behaviour



Message rate was 60.2 M/s offsets 136 200 216

Message rate was 68.7 M/s offsets 136 200 216

Message rate was 55.3 M/s offsets 136 200 216



Message rate was 39.2 M/s offsets 136 200 216

Message rate was 39.1 M/s offsets 136 200 216

Message rate was 37.5 M/s offsets 136 200 216



Message rate was 75.3 M/s offsets 136 200 216

Message rate was 73.8 M/s offsets 136 200 216

Message rate was 66.8 M/s offsets 136 200 216










EDIT: It appears that triggering a GC will shift the behaviour. These show repeated test on the same buffer+counters with a manually trigger GC halfway.







faster after GC



Message rate was 27.4 M/s offsets 136 200 216

Message rate was 27.8 M/s offsets 136 200 216

Message rate was 29.6 M/s offsets 136 200 216

Message rate was 27.7 M/s offsets 136 200 216

Message rate was 29.6 M/s offsets 136 200 216

[GC 14312K->1518K(244544K), 0.0003050 secs]

[Full GC 1518K->1328K(244544K), 0.0068270 secs]

Message rate was 34.7 M/s offsets 64 128 144

Message rate was 54.5 M/s offsets 64 128 144

Message rate was 54.1 M/s offsets 64 128 144

Message rate was 51.9 M/s offsets 64 128 144

Message rate was 57.2 M/s offsets 64 128 144



and slower



Message rate was 61.1 M/s offsets 136 200 216

Message rate was 61.8 M/s offsets 136 200 216

Message rate was 60.5 M/s offsets 136 200 216

Message rate was 61.1 M/s offsets 136 200 216

[GC 35740K->1440K(244544K), 0.0018170 secs]

[Full GC 1440K->1302K(244544K), 0.0071290 secs]

Message rate was 53.9 M/s offsets 64 128 144

Message rate was 54.3 M/s offsets 64 128 144

Message rate was 50.8 M/s offsets 64 128 144

Message rate was 56.6 M/s offsets 64 128 144

Message rate was 56.0 M/s offsets 64 128 144

Message rate was 53.6 M/s offsets 64 128 144










EDIT: Using @BegemoT's library to print the core id used I get the following on a 3.8 GHz i7 (home PC)





Note: the offsets are incorrect by a factor of 8. As the heap size was small, the JVM doesn't multiply the reference by 8 like it does with a heap which is larger (but less than 32 GB).







writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#5]

Message rate was 54.4 M/s offsets 3392 3904 4416

writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#6]

Message rate was 54.2 M/s offsets 3392 3904 4416

writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#5]

Message rate was 60.7 M/s offsets 3392 3904 4416



writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#5]

Message rate was 25.5 M/s offsets 1088 1600 2112

writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#5]

Message rate was 25.9 M/s offsets 1088 1600 2112

writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#5]

Message rate was 26.0 M/s offsets 1088 1600 2112



writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#5]

Message rate was 61.0 M/s offsets 1088 1600 2112

writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#5]

Message rate was 61.8 M/s offsets 1088 1600 2112

writer.currentCore() -> Core[#0]

reader.currentCore() -> Core[#5]

Message rate was 60.7 M/s offsets 1088 1600 2112







You can see that the same logical threads are being used, but the performance varies, between runs, but not within a run (within a run the same objects are used)








I have found the problem. It was a memory layout issue but I could see a simple way to resolve it. ByteBuffer cannot be extended so you can't add padding so I create an object I discard.







final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);

final ByteBuffer readBuffer = writeBuffer.slice();

new PaddedAtomicInteger();

final AtomicInteger readCount = new PaddedAtomicInteger();

final AtomicInteger writeCount = new PaddedAtomicInteger();







Without this extra padding (of the object which is not used), the results look like this on a 3.8 GHz i7.







Message rate was 38.5 M/s offsets 3392 3904 4416

Message rate was 54.7 M/s offsets 3392 3904 4416

Message rate was 59.4 M/s offsets 3392 3904 4416



Message rate was 54.3 M/s offsets 1088 1600 2112

Message rate was 56.3 M/s offsets 1088 1600 2112

Message rate was 56.6 M/s offsets 1088 1600 2112



Message rate was 28.0 M/s offsets 1088 1600 2112

Message rate was 28.1 M/s offsets 1088 1600 2112

Message rate was 28.0 M/s offsets 1088 1600 2112



Message rate was 17.4 M/s offsets 1088 1600 2112

Message rate was 17.4 M/s offsets 1088 1600 2112

Message rate was 17.4 M/s offsets 1088 1600 2112



Message rate was 54.5 M/s offsets 1088 1600 2112

Message rate was 54.2 M/s offsets 1088 1600 2112

Message rate was 55.1 M/s offsets 1088 1600 2112



Message rate was 25.5 M/s offsets 1088 1600 2112

Message rate was 25.6 M/s offsets 1088 1600 2112

Message rate was 25.6 M/s offsets 1088 1600 2112



Message rate was 56.6 M/s offsets 1088 1600 2112

Message rate was 54.7 M/s offsets 1088 1600 2112

Message rate was 54.4 M/s offsets 1088 1600 2112



Message rate was 57.0 M/s offsets 1088 1600 2112

Message rate was 55.9 M/s offsets 1088 1600 2112

Message rate was 56.3 M/s offsets 1088 1600 2112



Message rate was 51.4 M/s offsets 1088 1600 2112

Message rate was 56.6 M/s offsets 1088 1600 2112

Message rate was 56.1 M/s offsets 1088 1600 2112



Message rate was 46.4 M/s offsets 1088 1600 2112

Message rate was 46.4 M/s offsets 1088 1600 2112

Message rate was 47.4 M/s offsets 1088 1600 2112







with the discarded padded object.







Message rate was 54.3 M/s offsets 3392 4416 4928

Message rate was 53.1 M/s offsets 3392 4416 4928

Message rate was 59.2 M/s offsets 3392 4416 4928



Message rate was 58.8 M/s offsets 1088 2112 2624

Message rate was 58.9 M/s offsets 1088 2112 2624

Message rate was 59.3 M/s offsets 1088 2112 2624



Message rate was 59.4 M/s offsets 1088 2112 2624

Message rate was 59.0 M/s offsets 1088 2112 2624

Message rate was 59.8 M/s offsets 1088 2112 2624



Message rate was 59.8 M/s offsets 1088 2112 2624

Message rate was 59.8 M/s offsets 1088 2112 2624

Message rate was 59.2 M/s offsets 1088 2112 2624



Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624



Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.9 M/s offsets 1088 2112 2624

Message rate was 60.6 M/s offsets 1088 2112 2624



Message rate was 59.6 M/s offsets 1088 2112 2624

Message rate was 60.3 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624



Message rate was 60.9 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624



Message rate was 60.7 M/s offsets 1088 2112 2624

Message rate was 61.6 M/s offsets 1088 2112 2624

Message rate was 60.8 M/s offsets 1088 2112 2624



Message rate was 60.3 M/s offsets 1088 2112 2624

Message rate was 60.7 M/s offsets 1088 2112 2624

Message rate was 58.3 M/s offsets 1088 2112 2624







Unfortunately there is always the risk that after a GC, the objects will not be laid out optimally. The only way to resolve this may be to add padding to the original class. :(


Comments

  1. I'm not an expert in the area of processor caches but I suspect your issue is essentially a cache issue or some other memory layout problem. Repeated allocation of the buffers and counters without cleaning up the old objects may be causing you to periodically get a very bad cache layout, which may lead to your inconsistent performance.

    Using your code and making a few mods I have been able to make the performance consistent (my test machine is Intel Core2 Quad CPU Q6600 2.4GHz w/ Win7x64 - so not quite the same but hopefully close enough to have relevant results). I've done this in two different ways both of which have roughly the same effect.

    First, move creation of the buffers and counters outside of the doTest method so that they are created only once and then reused for each pass of the test. Now you get the one allocation, it sits nicely in the cache and performance is consistent.

    Another way to get the same reuse but with "different" buffers/counters was to insert a gc after the performTiming loop:

    for ( int i = 0; i < 3; i++ )
    performTiming ( writeBuffer, readBuffer, readCount, writeCount );
    System.out.println ();
    System.gc ();


    Here the result is more or less the same - the gc lets the buffers/counters be reclaimed and the next allocation ends up reusing the same memory (at least on my test system) and you end up in cache with consistent performance (I also added printing of the actual addresses to verify reuse of the same locations). My guess is that without the clean up leading to reuse you eventually end up with a buffer allocated that doesn't fit into the cache and your performance suffers as it is swapped in. I suspect that you could do some strange things with order of allocation (like you can make the performance worse on my machine by moving the counter allocation in front of the buffers) or creating some dead space around each run to "purge" the cache if you didn't want to eliminate the buffers from a prior loop.

    Finally, as I said, processor cache and the fun of memory layouts aren't my area of expertise so if the explanations are misleading or wrong - sorry about that.

    ReplyDelete
  2. you are busy waiting. that is always a bad idea in user code.

    reader:

    while ((toRead = writeCount.get() - rc) <= 0) ;


    writer:

    while (wc - readCount.get() > 0) ;

    ReplyDelete
  3. As a general approach to performance analysis:


    Try jconsole. Start your app, and while it's running type jconsole in separate terminal window. This will bring up the Java Console GUI, which allows you to connect to a running JVM, and see performance metrics, memory usage, Thread count and status, etc.
    Basically you're going to have to figure out the correlation between the speed fluxuations and what you see the JVM doing. It could also be helpful to bring up your task manager and see if your system is actually just busy doing other stuff (paging to the disk due to low memory, busy with a heavy background task, etc.) and put it side-by-side with the jconsole window.
    One other alternative is launching the JVM with the -Xprof option which outputs relative time spent in various methods on a per-thread basis. Ex. java -Xprof [your class file]
    Finally, there is also JProfiler, but it's a commercial tool, if that matters to you.

    ReplyDelete
  4. EDIT: It appears that triggering a GC will shift the behaviour. These
    show repeated test on the same buffer+counters with a manually trigger
    GC halfway.


    GC means reaching a safepoint which means all threads have stopped executing bytecode & the GC threads have work to do. This can have various side effects. For example, in the absence of any explicit cpu affinity, you may restart execution on a different core or cache lines may have been refreshed. Can you track which cores your threads are running on?

    What CPUs are these? Have you done anything about power management to prevent them dropping down into lower p and/or c states? Perhaps 1 thread is being scheduled onto a core that was in a different p state hence shows a different performance profile.

    EDIT

    I tried running your test on a workstation running x64 linux with 2 slightly old quadcore xeons (E5504), it's generally consistent within a run (~17-18M/s) with occasion runs much slower which appears to generally correspond with thread migrations. I didn't plot this rigorously. Therefore it appears your problem might be CPU architecture specific. You mention you're running an i7 at 4.6GHz, is that a typo? I thought the i7 topped out at 3.5GHz with a 3.9Ghz turbo mode (with an earlier version 3.3GHz to 3.6GHz turbo). Either way are you sure you're not seeing an artifact of turbo mode kicking in then dropping out? You could try repeating the test with turbo disabled to be sure.

    A couple of other points


    the padding values are all 0, are you sure there isn't some special treatment being meted out to uninitialised values? you could consider using the LogCompilation option to understand how the JIT is treating that method
    Intel VTune is free for 30 day evaluation, if this is a cache line problem then you could use that to determine what the problem is on your host

    ReplyDelete
  5. How do you actually pin your threads to cores? taskset is not the best way to pin thread to cores, since it just pin process to cores -- and all its threads will share this cores. Recall, java have many internal threads for it's own needs, so all them will contend on cores you'll bind them to.

    To have more consistent results you can use JNA to call sched_setaffinity() from only threads you need to. It will pin only your benchmarking threads to exact cores, while other java threads will spread on other free cores, having less influence on your code behavior.

    By the way, I've have similar issues with unstable performance while benchmarking highly optimized concurrent code. It seems, like where are too many things which can influence performance drastically while it is close to hardware limits. You should tune your OS somehow, to give your code the possibility to make it best, or just use many experiments and use math to have averages and confidence intervals.

    ReplyDelete
  6. There would certainly be some inconsistency brought in when a full GC runs, but that's not so often. Try modifying the stack size (Xss) to say 32M and see if that helps. Also, try clearing the 2 buffers at the end of each test to make it even easier for the GC to know that the contents can be collected. Interestingly, you have used thread.stop() which is deprecated and absolutely not recommended. I would suggest changing that too.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex