Flexihash - Consistent Hashing for PHP
— Paul Annesley, April 2008
After reading Tom Kleinpeter's article on consistent hashing last month, I was inspired to write Flexihash, an open source consistent hashing implementation for PHP, as I couldn't see anything decent around that fit the bill.
But first, a little bit of background and a use case example. C'mon, humor me…
There's a few reasons why a large website would want to spread their images (and other resources) out across several URLs. One reason is that web browsers limit the amount of concurrent connections per hostname, and the speed benefits of greater concurrency can be achieved by using multiple hostnames throughout a page. Another reason is so that the requests can be spread across multiple servers without funneling though a single load balancer.
To spread the images across multiple URLs, there needs to be a way to map a given URL path to one of the available servers. Once downloaded, the image will be cached in users browsers based on its URL, so every effort must be made to serve it from the same hostname in future. This rules out picking one of the servers at random each time.
A hash function consistently maps given data (anything, but in this case an image path) to a relatively small integer. Hash functions are consistent by definition, but don't confuse them with consistent hashing - read on!
Using a hash function (or a hash-like function such as CRC32), it would seem the goal is easily achievable. Hash the image path to an integer, modulus it by the number of servers, and look up the server at that index.
servers = [s1, s2, s3] servers[HASH(/path/to/image.png) % servers.length] # => s2
The trouble with hashing is that it turns into thrashing the moment the list of servers changes. If another server is added to handle increased load, it'll certainly get a good workout, as suddenly most of the image URLs are hashing to a different hostname, missing the users browser cache, missing the front proxy cache and pummeling the servers. If a server is removed, even worse, as there'll be less cannon fodder for flood of HTTP.
Onto the good stuff…
As with most problems in life, the logical solution starts with drawing a ring to represent your hash address space as a continuum. I solve just about everything that way. Randomly placing a dot somewhere on the enormous ring for each of a handful of servers will likely result in uneven distribution, which can be fixed by adding another hundred or so randomly placed dots for each server.
The whole idea sounds pretty strange, until you realise that you can now hash any of the image paths to a point on that ring, and then moving clockwise, pick the closest server. If a server is added, it wont get in the way of many lookups, and if one is removed it wont be sorely missed.
paulbookpro:flexihash paul$ php tests/runtests.php --with-benchmark All Tests NonConsistentHash: 92% of lookups changed after adding a target to the existing 10 NonConsistentHash: 90% of lookups changed after removing 1 of 10 targets ConsistentHash: 6% of lookups changed after adding a target to the existing 10 ConsistentHash: 9% of lookups changed after removing 1 of 10 targets Distribution of 100 lookups per target (min/max/median/avg): 71/126/109/100 …
Of course, consistent hashing has far more useful applications than distributing URLs. In applications like server-side distributed caching (e.g. memcached), consistent hashing has another great property - redundant writes. When picking a server to write a cache item to, rather than stopping at the closest server on the ring, continuing to move clockwise will find the second best server for that item. Writing the cache item to the first and second best server means removing the first one causes little to no cache misses.
<?php // write cache item to both targets in case cache-2 is removed $hash->lookupList('object', 2); // [cache-2, cache-4] $hash->removeTarget('cache-2'); $hash->lookup('object'); // cache-4
Check It Out
Update Feb 2009: ignore the following paragraph, find Flexihash at GitHub!
You can grab Flexihash from Google code, either as a source archive, a single bundled PHP file, or via Subversion (I'm not cool enough to be using Git just yet). There's also some pretty graphs, stats and information at Flexihash on Ohloh which suggest Flexihash is crap. Don't believe the lies!
<?php require_once('flexihash.php'); $hash = new Flexihash(); // bulk add $hash->addTargets(array('cache-1', 'cache-2', 'cache-3')); // simple lookup $hash->lookup('object-a'); // cache-1 $hash->lookup('object-b'); // cache-2 // add and remove $hash ->addTarget('cache-4') ->removeTarget('cache-1'); // lookup with next-best fallback (for redundant writes) $hash->lookupList('object', 2); // [cache-2, cache-4] // remove cache-2, expect object to hash to cache-4 $hash->removeTarget('cache-2'); $hash->lookup('object'); // cache-4