As we discussed in the lectures, the upcoming Intel processors will ship with hardware TM. Writing efficient transactions for the hardware TM is highly dependent on the L1 cache behavior. I will describe two problems regarding the caches and a couple of ways to mitigate the unwanted behaviors.
The first problem is false aborts. The smallest accessible cache unit is a block (a.k.a. cache line). The block size of the L1 cache is 64B and it is optimized for getting the maximum benefits from the locality in accesses. This also means that the smallest chunk of data that can be tagged with read/write set is also 64B. There are cases where the entries of the same data structure reside in the same cache block (e.g., two 32B entries). The problem arises when one of the entries in the cache block is speculatively modified but the other is not. All the accesses that go to the clean entry will trigger a false abort. To solve this problem, the Java compiler should pad the synchronized data structures so that each entry will span the whole cache block. Obviously, some of the cache will be wasted but the cost of false abort is much higher than this.
The second problem is cache set contention. The processor’s L1 caches are 4-way set associative caches. A 32KB cache with 64B blocks can hold 512 blocks but the blocks are not placed contiguously, they are distributed based on the lower bits of their address. For example, the 4-way set associative cache distributes 512 blocks to 128 sets (512/4), and each set holds four blocks inside. As a result the cache starts evicting the blocks not when the cache is full (512 lines) but the set is full (4 lines). A new block brought into the set will kick out one of the four blocks and the transaction will abort if the evicted block is speculative. A solution to this problem would be reorganizing
the address layout of the objects in the Java heap. Luckily, Intel processor caches are indexed with virtual addresses and therefore the data structures’ set mappings can be determined by the JVM (by getting cache information from OS). A new-allocated and non-synchronized object can be mapped to limited number of sets and therefore can help reducing the pressure on other sets. The caveat here is that the object should be small, if it is not, it should be mapped non-contiguously in the JVM heap.
For the set contention problem, another solution would be managing the L1 capacity by distributing the threads employing synchronized structures over multiple cores to minimize the set contention and capacity pressure. It is important to note that binding the threads on physical cores cannot be done with JVM only and requires JNI functions (taskset etc.).
To conclude, false aborts and set contention can limit the system to make the most of TM’s benefits. Cache-aware object placement, allocation and capacity management can help us to push TM to its limits.