The tradeoff in the case of memory deduplication is the latency/bandwidth of the data access vs. the wasted memory due to duplicates. From a performance perspective, an ideal solution would have a copy of the data at all the memory nodes using it. However, suppose there is a large duplicated structure used by multiple VMs, but which is only rarely accessed. The memory waste vs. the actual gains in this situation do not make multiple copies worth while.
The proposed solution starts from a single deduplicated copy of the data, like for the ESX VMM; more copies are created if necessary at the appropriate nodes, and/or are removed when no longer required.
We first discuss duplication. The proposals use structures similar to ESX (the hash), with small changes. A simple solution would be to duplicate a page in the local node in the event of a TLB hit for a page stored on a remote memory node (it means that page was recently accessed multiple times). The advantage of this solution is the fact that no extra accounting is required from the VMM. However, if more accuracy is desired, one could have a per page counter reflecting the number of times cores on each node requested a page and did not find it locally (should also be proportional to the number of hops between the various nodes in the NUMA machine). When this number reaches a threshold for a particular node, the content of the page may be copied at that node. After duplication, a remapping from physical addresses to the new machine address related to that page would have to be done for the VMs which are now closer to the new copy than to others.
Regarding excess duplication, in my opinion the mechanisms already proposed in ESX Server serve as a good way of removing copies of data rarely used. Data that is not important for a VM will naturally be swapped out as the balloon size is increased; moreover, if a copy of the data is only accessed seldom, it will contribute to the idle memory tax. Therefore, in this proposal, once a duplicate of a page is created, it is never merged back with the original. However, if a page is swapped out and then swapped back in, it will initially be assumed that deduplication is acceptable for that page.
Another point to be made is the fact that applications may actually set the memory node on which they would like their allocations to take place (e.g. the “numa set preferred” instruction from libnuma). In this case, the VMM could take into account such explicit requests and when
possible allocate the data of the VM on the requested node, even if reducing the effect of the deduplication technique.
To conclude, it is important to note that whether such techniques are useful highly depends on the architecture. In the proposed solution, by adjusting the duplication threshold, one can control the invasiveness of these mechanisms.