The content-based page sharing mechanism described in the ESX paper transparently deduplicates identical pages regardless of the VM they belong to. The paper states that the current algorithm scans pages at random when the CPU is idle, computes a hash, looks it up in the table and if the page has already been deduplicated it just increases the counter and reclaims the page. Otherwise, it enters the hash marked as hint into the table, and once a duplicate has been found marks one page as copy-on-write and reclaims the other.
In the case of NUMA, I would adapt the mechanism as follows: The scanning threads should check the pages in local memory (the memory on the same node). Inspired by generational GC, I would have per-node hash tables and deduplication as a first step, this prevents the scanning thread from using inter-node bandwidth and moving pages away to slower remote memory. It would be nice if the scheduler was smart enough to run copies of the same VM on the same node as much as possible.
The deduplicated pages would be marked with a timestamp or access count, so that we can distinguish hot from cold pages (for example startup code). Then, if one node is under a lot of memory pressure, it could start checking whether cold pages have duplicates on other nodes and use the remote copy, with some policy to duplicate the page back to the original node.
We also get hints about page value from guest OSes: when the balloons inflate, guest OSes will start freeing buffer cache pages, and then will page out to virtual disk. While the hypervisor doesn't exactly know what virtual disk activity is caused by paging, it knows the timing, and hashing is cheap (possibly even done in hardware). If pages are deduplicated and cold (they might be hot in other VMs on the same node) it can check the other nodes for duplicates before writing to disk. If some other nodes have low memory pressure, it might even be favorable to move all virtually paged out pages. While this certainly speeds up access once the OS pages it back in, we have to trade off with extra complexity and using bandwidth to and resources on the other node. Also, the scheduler is responsible for moving about whole VMs, which also updates the hash tables on both nodes (for example by having a local and a remote reference counter).
In conclusion, NUMA modifies the memory hierarchy by layering RAM into different classes of closeness. While the bandwidth and speed differences are within one order of magnitude per hop, overall performance can suffer from too eager deduplication. On the other hand, memory is usually a precious resource, and if a page is deduplicated away to memory that is 20x slower, but the freed page can be used to save on disk accesses that are five orders of magnitude slower, we still gain a lot.