I think the key thing to realize here is that when PHP says "array", they apparently mean "ordered map": if you are allowed to have 'foo' as a key, it is somewhat bothersome to call it an "array". Then, once you realize that it is actually a hashtable, it is fairly obvious that there will be a worst-case input set.
PHP uses the parlance "associative array" most often, to be fair. The confusion is really that PHP has few types that cover a lot of features. PHP combines its numerically-indexed array type with its ordered maps, and if you start by using numbers as keys, there's no conversion once you introduce a String key that makes it associative. Array functions (pop/push) will continue to work, as will dictionary key sorting, etc.
I would like to know if PHP's object implementation, which was introduced after their associative arrays, suffers from the same hash collision issues. I've read CPython's hash algorithm (it's quite beautiful) and I wonder if PHP internally represents objects as glorified hash maps also.
> I would like to know if PHP's object implementation, which was introduced after their associative arrays, suffers from the same hash collision issues.
I was curious myself, so I went digging. Most of this stuff appears in `Zend/zend_object_handlers.c`. For objects with std_object_handlers set up on them (which I assume is most of them), zend_std_read_property can do quite a bit of work: it tries a search of the property info hash zobj->ce, then the zobj->properties hash, and then calling `__get()` with protection against __get loops, then giving up and returning null. zend_hash_quick_find(), as used for these lookups, takes a HashTable* as its first argument, which is also the type of a hash table in zvalue_value.
Thus, object properties are associative arrays, though I don't think there's an equivalent severity of attack compared to causing collisions with crafted GET/POST data, as the latter is automatically parsed. You'd have to be able to upload code to make evil objects.
Yes that is just one way in which PHP is annoying. I think they actually refer to it as an "associative array", which as far as I can tell is exactly a hash, but they never refer to it as one.
On a side note, Ruby does not appear to have that same problem as similar code executes in .02 seconds. I think it is probably because Ruby's hash function is superior.
Having just done this research... Ruby 1.9 has apparently had randomized hash functions for years. Ruby 1.8.7 and JRuby were apparently vulnerable until yesterday, but both released fixed versions yesterday.
Meanwhile PHP will presumably do a similar set of fixes, though I haven't seen the announcement yet. There are workarounds for web apps that involve using extensions to limit the size of POST requests and/or the acceptable number of parameters in a POST request.
_ In the case of the Ruby language, the 1.9.x branch is not affected by the predictable collision condition since this version includes a randomization of the hashing function._
So there is some merit to what the commenter is saying, though I doubt he knew the above.
Actual ruby arrays (which are arrays and not hashes) will obviously not exhibit this problem though.
I think for all practical purposes, unless you're doing something really weird, the likelyhood of hash function collisions is rare enough that we don't need to think too much about it.
I think for all practical purposes, unless you're doing
something really weird, the likelyhood of hash function
collisions is rare enough that we don't need to think too
much about it.
Except that, like with PHP, the worrying part is that someone can stuff rack.request.form_hash or rack.request.query_hash (a la PHP's $_POST and $_GET).
(Unlike PHP, though, the Ruby community can head off these particular attacks by releasing a new version of Rack, while waiting for a new 1.8.x release containing a security patch.)
> PHP already landed a change (which will ship with PHP 5.3.9) which will add a max_input_vars ini setting which defaults to 1000. This setting determines the maximum number of POST/GET variables that are accepted, so now only a maximum of 1000 collisions can be created.
Wait, where did we establish that less user input = less array insertions?
Well, the point of that change was to prevent a hostile user from DoSing your server with malicious GET/POST requests.
I imagine it would actually be fairly difficult to accidentally recreate this issue, or let it slip through testing. No amount of patches in the world will protect you from idiots with access to your codebase
It is somewhat risky to fundamentally change the hashing algorithm late in the release cycle (RC4). It is bound to cause problems. The ini-Option prevents the obvious threat without doing deep changes to the core.
If you're running a version of PHP which is patched with Suhosin (http://www.hardened-php.net/suhosin/) you'll already be protected from the DoS vector outlined in the article. By default Suhosin limits the max number of $_GET/$_POST/$_COOKIE parameters to 200.
> So if we insert a total of 64 elements, the first one 0, the second one 64, the third one 128, the fourth one 192, etc., all of those elements will have the same hash (namely 0) and all will be put into the same linked list.
This doesn't seem correct, or I'm missing something.
Upon the first insertion, PHP doesn't know that you intend to insert 63 more elements. It shouldn't allocate a 2^6-element underlying array until it exceeds 2^5 elements, right? So the first 2^5 insertions would be constant time, and only the next 2^5 would be linear.
I'm not sure how PHP performs the reallocation to increase an array's capacity. Maybe it allocates a blank array, and then inserts the existing elements using the standard insertion algorithm. In that case 2^6 linear-time insertions would occur -- half during the reallocation, and half afterwords. But it still bears mentioning that performance wouldn't tank until you inserted half+1 of the values.
It's not really an array, it's a hash table. When you insert an item it doesn't simply put it in a previously allocated array, but places it at the end of a linked list in the correct bucket determined by the hash. In order to check if the key has already been used the list must be traversed each time, and so insertion becomes a linear operation.
Even if the HT hasn't reached the full size yet all elements will nonetheless collide: If the size currently is at 16 then 64 will still have a hash of 0, as will 128, etc. For 16 simply other values would additionally collide like 16, 32, 48, etc. These would stop being collisions as soon as the HT is resized, so we don't insert them.
Even if the real C array contained a pointer to the last element in the LL (which it does not, it only points to the first) it would still be worst-case O(n): PHP has to check all elements in the LL to ensure that the element does not exist yet.
Of course! I hadn't considered that inserting an element can mean updating an already existing element.
In C you don't need a pointer to the last element though, you can just replace the pointer to the first element with the new element and put the old pointer in the "next" field of the new element. I usually implement stacks this way.
Even if the real C array contained a pointer to the last element in the LL (which it does not, it only points to the first)
What version of PHP are you looking at? PHP 5.3 and up seem to have a sensible linked list implementation in zend_hash.[ch]. The buckets only have a pointer to the list head, but items are inserted at the head. The hash has a separate list for ordered traversal of all items in the "array" and that has pointers to both the head (for traversal) and tail (for insertion). In both cases, list insertion is O(1).
Of course, as you said, the search for our worst case is necessarily O(N). Depending on your perspective, we can say that's a trade-off with hashes, a fault with the hash algorithm, or a fault with the collision resolution strategy.
I just ran this (using the PHP CLI) on my Lion MacBook Air and it took far, far longer than 30 seconds - 172, to be exact.
Not being at all familiar with the underlying mechanisms, does anybody understand why it would take so much longer on {PHP 5.3.6, Lion, my MacBook Air, the PHP CLI}?
I know it's hard to believe, the the low-end MBA has a Core i5 processor. https://www.apple.com/macbookair/features.html Apple worked directly with Intel to get a smaller "package", and soldered it to the mainboard to make it fit.
That's interesting, I've got a Core 2 Duo which I assumed would be similar to the i3 in terms of power. I guess it shows that processor versioning is more than just marketing.
Running php -a will give you an interactive shell as well, however, if you do a lot of php development phpsh http://www.phpsh.org/ is a lot more feature rich and can be a huge timesaver.
Haven't tested it yet, but I believe that if you use suhosin patch you can limit the number of variables with suhosin.get.max_vars and suhosin.post.max_vars so you do not have to wait for php to the release the version with the patch (I guess most of the people do not compile PHP from Subversion trunk for production usage).
Yes. This is a trivial exploit of a designed weakness in hashtables.
I shall, arrogant as I am, scoff at people who are surprised when this happens. I mean, honestly, what kind of developer doesn't know the basic properties of their data structures?
You make it sound like you're somehow disagreeing with him, but what he says is true even of Ruby's hash algorithm. Introducing randomness into the hash function is really just a band-aid on this vulnerability. The inherent vulnerability is there either way; you just need a bit of runtime information to do the attack when runtime information is introduced into the hash function.
Yes, those are no longer values that will hit the worst-case performance here.
Did you read the article? The reason for the bad performance is that the way that PHP hashes integers is just taking the integer mod the size of the hash table. So for values which are 0 mod the size of the hash table, they will all go into the same bucket. If your hash table has a size that's a power of 2, then any sum of powers of two equal or larger than the size of the hash table will hash to 0, and go into the same bucket. So if you insert a whole bunch of values that increase by a large power of two, then they will all hash to 0 and give you O(n^2) performance.
Note that the size referred to there is being use not only as the size of the array, but also the amount to iterate by. By changing that to something other than a power of 2, you are no longer inserting "evil" elements.
The point of this article is that it's really, really easy to do a denial of service attack on PHP arrays by picking array indices that are large powers of two. Of course you can fix the problem by not using large powers of two.
Very nice explanation. Thanks to you and a few others I now understand what is happening far better :). Interesting, even at 256 it takes 10 times as long to insert the evil elements.
Thanks again for helping me understand and learn something new today :)