Saturday, March 29, 2014

A Pauseless HashMap

HashMaps are great. And fast. Well, fast most of the time. If you keep growing them, you'll get elevator music every once in a while. Then they go fast again. For a while.

Wouldn't it be nice if HashMaps didn't stall your code even when they were resizing?

Some background: As those of you who have read my various past rants may have noticed, I spend a lot of my time thinking about the behavior of {latency, response-time, reaction-time}. In addition to trying to better understand or teach about the behavior (with monitoring and measurement tools like HdrHistogram, LatencyUtils, and jHiccup), I actually work on things that try to improve bad behavior. For some definitions of "improve" and "bad". Eliminating pausing behavior in GC was the lowest hanging fruit, but more recent work has focused on eliminating pauses due to other things that stand out once those pesky GC blips are gone. Things like at-market-open deoptimzations. Things like lock deflation, lock de-biasing, class unloading, weak reference processing, and all sorts of TTSP (time to safepoint) issues. I've also learned a lot about how to bring down Linux's contribution to latency spikes.

But the JVM and the OS are not the only things that cause latency spikes. Sometimes it's your code, and the code is doing something "spiky". In my day job, I keep running into in actual, real-world low latency system code that is typically super-fast, but occasionally spikes in actual work latency due to some rare but huge piece of work that needs to be done. This is most often associated with some state accumulation. Once we eliminate GC pauses (which tend to dominate latency spikes, but also tend to simply disappear when Zing is applied), we get to see the things that were hiding in the GC noise. We often run into "nice" patterns of growing latency spikes at growing intervals, with a near-perfect doubling in both magnitude and interval between the spikes. This happens so often that we've studied the common causes, and (by far) the most common culprits seem to be HashMaps. The kind used to accumulate something during the day, and which resize in powers-of-two steps as a result.

I've had "build a Pauseless HashMap" on my weekend project list for over a year now, but finally got around to actually building it (at the request of a friend on the mechanical sympathy group). There are probably at least 17 ways to skin a HashMap so it won't stall puts and gets when it resizes, but this is my simple take on it: 


Keep in mind that (so far) this is a "probably-working draft" that's gone through some bench testing, but is not yet battle hardened (scrutiny is welcome).

I intentionally based this version on Apache Harmony's version of HashMap, and not on OpenJDK's, in order to make it available without GPLv2 license restrictions (for those who may want to include it in non-GPL products). The background resizing concept itself is simple, and can be applied just as easily to the OpenJDK version (e.g. if some future Java SE version wants to use it). You can use (https://svn.apache.org/repos/asf/harmony/enhanced/java/trunk/classlib/modules/luni/src/main/java/java/util/HashMap.java) as a baseline comparison for the code I started with.

This is also a classic example of how GC makes this sort of concurrent programming thing both fast and simple. This is a classic case of an asymmetric speed need between two concurrent actors that share mutable state. I worked hard to make the fast path get() and put() cheap, and managed (I think) to not even use volatiles in the fast path. In doing this, I shifted all the cost I could think of to the background work, where latency doesn't matter nearly as much. This sort of trick would be much harder (or slower) to do if GC wasn't there to safely pick up the junk behind us, as it would (invariably, I think) add a need for additional synchronizing work in the fast path.

7 comments:

  1. Hi Gil,
    usually one tries to avoid resizes by statically "guessing" the "right" size :-). Async resizing can be of particular interest when having very large tables >5 million entries. Many implementations out there show up resizing pauses up to several seconds in these cases.
    Would be interesting what cost (if any) there exists compared to the original "blocking-resize" version (for the non-resizing case). Did you do any comparision ?

    ReplyDelete
    Replies
    1. Rüdiger, obviously getting the size "right" from the start (so that no resizing is needed, ever) is best. But it's often not practical, and more importantly (based on experience) often not done.

      As for the cost difference compared to the original "blocking-resize" version, I haven't seriously benchmarked these yet, but code analysis is pretty simple in this case: If you look at the code for get() and put(), and compare them to the original Harmony implementation (see link in post), you'll see the following for the fast path cost deltas are tiny when no resizing is in flight:

      get():
      - For a hitting get(), there is zero extra cost.
      - For a missing get(), there is an extra (well predicted) branch when no resizing is going on.

      put():
      - There are two extra (well predicated) branches.

      When resizing is ongoing, the costs roughly double for both get and put, but don't grow past that. This is obviously a great result compared to stalling...

      Delete
    2. Thanks for explanation, i actually had only a quick glance at the source ..
      In depth discussion of this can be found here:
      https://groups.google.com/forum/?fromgroups&hl=de#!topic/mechanical-sympathy/DY8vysxdmj4

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Why did not use one of the many published extendible hashing (or dynamic hashing) systems, I recalled them from my Comp Sci degree 20 years ago, but yet have not seen them used in “modem” systems.
    http://arl.wustl.edu/~sailesh/download_files/Limited_Edition/hash/Dynamic%20Hash%20Tables.pdf is one of many papers that can be used as the start of a research search.

    ReplyDelete
  4. Because I wanted lookup times similar to HashTable. The various extensible schemes come with significantly higher get() costs.

    ReplyDelete
  5. thanks you for sharing
    i like this page
    Thanks for explanation, i actually had only a quick glance at the source ..
    In depth discussion of this can be found here
    gclub casino
    golden slot mobile
    gclub

    ReplyDelete