Is volatile expensive?

After reading http://gee.cs.oswego.edu/dl/jmm/cookbook.html about the implementation of volatile, especially section "Interactions with Atomic Instructions" I assume that reading a volatile variable without updating it needs a LoadLoad or a LoadStore barrier. Further down the page I see that LoadLoad and LoadStore are effectively no-ops on X86 CPUs. Does this mean that volatile read operations can be done without a explicit cache invalidation on x86, and is as fast a normal variable read (disregarding the reordering constraints of volatile)?

I believe I don't understand this correctly. Could someone care to enlighten me?

EDIT: I wonder if there are differences in multi-processor environments. On single CPU systems the CPU might look at it's own thread caches, as John V. states, but on multi CPU systems there must be some config option to the CPUs that this is not enough and main memory has to be hit, making volatile slower on multi cpu systems, right?

PS: On my way to learn more about this I stumbled about the following great articles, and since this question may be interesting to others, I'll share my links here:

Java theory and practice: Fixing the Java Memory Model, Part 1 and

Java theory and practice: Fixing the Java Memory Model, Part 2

Comments

Tips For AllMarch 9, 2012 at 6:37 AM
In the words of the Java Memory Model (as defined for Java 5+ in JSR 133), any operation -- read or write -- on a volatile variable creates a happens-before relationship with respect to any other operation on the same variable. This means that the compiler and JIT are forced to avoid certain optimisations such as reordering instructions within the thread or performing operations only within the local cache.

Since some optimisations are not available, the resulting code is necessarily slower that it would have been, though probably not by very much.

Nevertheless you shouldn't make a variable volatile unless you know that it will be accessed from multiple threads outside of synchronized blocks. Even then you should consider whether volatile is the best choice versus synchronized, AtomicReference and its friends, the explicit Lock classes, etc.
ReplyDelete
Replies
Tips For AllMarch 9, 2012 at 6:37 AM
Accessing a volatile variable is in many ways similar to wrapping access to an ordinary variable in a synchronized block. For instance, access to a volatile variable prevents the CPU from re-ordering the instructions before and after the access, and this generally slows down execution (though I can't say by how much).

More generally, on a multi-processor system I don't see how access to a volatile variable can be done without penalty -- there must be some way to ensure a write on processor A will be synchronized to a read on processor B.
ReplyDelete
Replies
Tips For AllMarch 9, 2012 at 6:37 AM
Generally speaking, on most modern processors a volatile load is comparable to a normal load. A volatile store is about 1/3 the time of a montior-enter/monitor-exit. This is seen on systems that are cache coherent.

To answer the OP's question, volatile writes are expensive while the reads usually are not.

Does this mean that volatile read
operations can be done without a
explicit cache invalidation on x86,
and is a fast a normal variable read
(disregarding the reordering
contraints of volatile)?

Yes, sometimes when validating a field the CPU may not even hit main memory, instead spy on other thread caches and get the value from there (very general explanation).

However, I second Neil's suggestion that if you have a field accessed by multiple threads you shold wrap it as an AtomicReference. Being an AtomicReference it executes roughly the same throughput for reads/writes but also is more obvious that the field will be accessed and modified by multiple threads.

Edit to answer OP's edit:

Cache coherence is a bit of a complicated protocol, but in short: CPU's will share a common cache line that is attached to main memory. If a CPU loads memory and no other CPU had it that CPU will assume it is the most up to date value. If another CPU tries to load the same memory location the already loaded CPU will be aware of this and actually share the cached reference to the requesting CPU - now the request CPU has a copy of that memory in its CPU cache. (It never had to look in main memory for the reference)

There is quite a bit more of protocol involved but this gives an idea of what is going on. Also to answer your other question, with the absence of multiple processors, volatile reads/writes can in fact be faster then with multiple processors. There are some applications that would in fact run faster concurrently with a single CPU then multiple.
ReplyDelete
Replies

Add comment

CCNA, CCNP, MCSA, CCNA Final Exam, All Answer Test Module With 100/100

Search This Blog

Is volatile expensive?

Labels

Comments

Post a Comment

Popular posts from this blog

CCNA 1 Final Exam 2011 latest (hot hot hot)

Slow Android emulator

What is the worst gotcha in C# or .NET?