False Sharing — The lesser known performance killer

In this article we will be looking into the concept of false sharing and how it could hamper your applications performance. We will also be exploring some related concepts like cache coherence, cache lines and java’s Contended annotation for prevention of false sharing.

What is False sharing?

Let us look how wikipedia defines this concept -

False sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that will never be altered by another party, but those data share a cache block with data that are altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.

Let us look at some of the related concepts before diving deep into false sharing.

Hardware Caching and Cache Line

We all know that reading and writing from machine’s memory directly is a slow process, though much faster than reading from the hard disks. To account for this slow memory access, most processors today use caching to improve the performance.

Machines these days use multiple level of caching, referred to as L1, L2, L3 and L4. L1 is the fastest but also the most expensive, so machines tend to have low size of L1 cache. L2 on the other hand is slower as compared to L1 but is less expensive, hence the machines tend have larger size of L2 cache.

When data is read from memory, the requested data as well as data around it is loaded from memory into the caches, then the program is served from the caches. This is referred to as a cache line. A cache line is formally defined as the unit of data transfer between the cache and main memory. This loading of a whole cache line rather than individual bytes can dramatically improve application performance. On our laptop the cache line size for both L1 and L2 is 64 bytes. Since applications frequently read bytes from memory in a sequential manner, they can avoid hitting main memory on every request by loading a series of data in a cache line. This increases the chance of the required data already being present in the cache.

You can check your laptops hardware cache details by running the following command, sysctl -a | grep cache. My system(Macbook pro 2015) has an L1 cache(L1I and L1D) of 32KB, L2 cache of 256KB and L3 cache of 3MB

Cache Coherency

In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of shared data: one copy in the main memory and one in the local cache of each processor that requested it. When one of the copies of data is changed, the other copies must reflect that change. Cache coherence is the mechanism which ensures that the changes in the shared data are propagated throughout the system in a timely fashion. This ensure that cache data in main memory is in sync with data in other caches

src: https://en.wikipedia.org/wiki/Cache_coherence

The MESI protocol is a cache coherence protocol, and is one of the most common protocols. In the MESI protocol, each cache line can be in one of these four distinct states: Modified, Exclusive, Shared, or Invalid.

Lets look at this protocol further, through an example

  1. Two cores core X and core Y try to read long values x and y from main memory. Let us assume that x and y are close to each other and lie in the same cache line.
  2. Core X reads the value of x from the main memory. As seen before this core will fetch a few more values from the memory and store them into a cache line. Then it marks that cache line as exclusive since core X is the only core operating on this cache line. Now whenever possible, this core will read the value from the cache line instead of less efficient read from the main memory.
  3. Now let’s say core Y also decides to read the value of y from the main memory. Since y was in the same cache line as x, both cores will tag their cache lines as shared
  4. Let’s say that now core X decides to modify the value of x. It modifies its local cache and change the status of its cache line to modified.
  5. Core X communicates it’s changes to core Y, which will mark it’s cache line as invalid. This way both core X and Y are in coherence.

False Sharing

Now, let’s come back to our topic of false sharing. Let us take the above explained example to look at how false sharing can occur.

  1. Just to recap, our core X cache line was in modified state, whereas the core Y cache line was in invalid state.
  2. Now, suppose core Y wants to read the value of y again. Since the cache line was invalidated, it can’t read the value from cache and has to do the inefficient read from the main memory (cache miss).
  3. This will force the core X to flush it’s store buffer. Now both will have updated cache line marked in shared state.
    You might think what is store buffer. Usually, the processors buffer modifications they make in their store buffers before flushing it back to the main memory. It takes a bunch of small writes (think 8 byte writes) and packs them into a single larger transaction (a 64-byte cache line) before sending them to the memory system. Buffering and flushing back in batches can be a huge performance boost.
  4. This phenomenon of cache miss, even when the data blocks resided in different memory locations and weren’t directly updated, is called False sharing. This imposes a cache miss to one core and an early buffer flush to another one, even though the two cores weren't operating on the same memory location

By increasing the number of cache miss and much more frequent access of data from main memory, the performance of the system is negatively affected.

Avoiding false sharing

Now, that we know what false sharing is and how it occurs, let’s look into how can we avoid it. We will also be looking at what support java provides to tackle this problem

As seen from our previous example, the whole issue occurred because of the two values x and y, lying in the same cache line. A simple fix to solve this is to add padding around the 2 values, so that both these values reside in different cache line.

Since our cache line size is 64 bytes and long is 8 bytes. We know that both existing x and y variables were long. So by adding 7 more long variables as a padding we can make sure that the long y is in a different cache line.

Using volatile reduces the risk of our padded unused variables from getting removed by the JVM. Dead Code Elimination is an optimisation done by JVM that removes code which does not affect the program results. Another way to escape from dead code elimination phenomenon could be through logging.

@Contended Annotation (Java specific)

Java handles false sharing internally through the @Contended annotation. Java code describes this annotation as following -

An annotation expressing that objects and/or their fields are expected to encounter memory contention, generally in the form of “false sharing”. This annotation serves as a hint that such objects and fields should reside in locations isolated from those of other objects or fields. Susceptibility to memory contention is a property of the intended usages of objects and fields, not their types or qualifiers. The effects of this annotation will nearly always add significant space overhead to objects.

@Contended is a sun.misc annotation, which means that we should ideally not use it in our code

Some examples of core java code(java 8), where @Contended is widely used -

  1. ForkJoinPool class

2. ThreadLocalRandom

That’s all from my side, folks. I hope this article made sense to you. Please feel free to give your feedback. Check out the articles on medium at — hhttps://pratyushbansal.medium.com/

Related Articles