Bug 150137 - IdentifierRepHash has a mismatch between hash() and equal()
Summary: IdentifierRepHash has a mismatch between hash() and equal()
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: All All
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-14 13:16 PDT by Filip Pizlo
Modified: 2015-10-16 01:56 PDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Filip Pizlo 2015-10-14 13:16:16 PDT
IdentifierRepHash inherits equal() from PtrHash, so it defines equality as pointer equality.  But it implements hash() by calling the StringImpl's hash method.

This seems either incorrect or inefficient.

The hash() method will return a hash value based on the string's value, not the string's identity.  So, non-identical string might end up having the same value and so the same hash.  Also, that hash takes more effort to compute than the normal PtrHash.

Quite likely, the PtrHash::hash method would be most efficient, since it doesn't require loading from the pointer.  Loads are usually the more expensive things.

Also, this means that we cannot have a HashMap with IdnetifierRepHash that doesn't ref its strings and outlives some of the strings that are in it.  This seems dirty, but it might actually be profitable in some cases, like InferredTypeTable.  It's OK if that table has an entry for a dangling StringImpl*, because all this means is that if some StringImpl gets allocated in the same memory address, we'll already have a hypothesis about the type.  It might be a wrong hypothesis, but that's still sound - at worst we'll come up with a looser type than what we could have done.  This might be a profitable optimization if the reduction in ref/deref traffic produced a big enough speed-up.

Anyway, it seems like IdentifierRepHash is doing either too much work in hash() or too little work in equal(), and we should fix it to be consistent, probably by just replacing it with PtrHash.
Comment 1 Yusuke Suzuki 2015-10-15 04:54:20 PDT
I don't remember the exact bug number (maybe, in the context of Symbol and IdentifierRepHash), IIRC, when I attempt to change this to PtrHash, I heard the following reason why we did not use PtrHash; We intentionally use StringHasher instead of PtrHash because it gives nicer distribution for Strings and reduces collisions in the hash table.
So when changing this, I think measuring collision and performance is nice.
Comment 2 Yusuke Suzuki 2015-10-15 05:01:16 PDT
(In reply to comment #1)
> I don't remember the exact bug number (maybe, in the context of Symbol and
> IdentifierRepHash), IIRC, when I attempt to change this to PtrHash, I heard
> the following reason why we did not use PtrHash; We intentionally use
> StringHasher instead of PtrHash because it gives nicer distribution for
> Strings and reduces collisions in the hash table.
> So when changing this, I think measuring collision and performance is nice.

Personally I guess PtrHash gives nice distribution. PtrHash maps intptr_t => unsigned. StringHasher maps UChar[arbitrary length] => unsigned. If PtrHash works well, it could provide nice distribution in this case I think.
Comment 3 Filip Pizlo 2015-10-15 11:27:41 PDT
(In reply to comment #2)
> (In reply to comment #1)
> > I don't remember the exact bug number (maybe, in the context of Symbol and
> > IdentifierRepHash), IIRC, when I attempt to change this to PtrHash, I heard
> > the following reason why we did not use PtrHash; We intentionally use
> > StringHasher instead of PtrHash because it gives nicer distribution for
> > Strings and reduces collisions in the hash table.
> > So when changing this, I think measuring collision and performance is nice.
> 
> Personally I guess PtrHash gives nice distribution. PtrHash maps intptr_t =>
> unsigned. StringHasher maps UChar[arbitrary length] => unsigned. If PtrHash
> works well, it could provide nice distribution in this case I think.

Do you recommend measuring distribution directly, or do you think it's OK to just see how performance is affected overall?

We may encounter a trade-off between a fast to compute hash function that gives worse distribution and a slow to compute hash function that gives better distribution.  I don't have a philosophical preference between the two, but I just suspect that unless the distribution is really bad, the faster-to-compute hash function will give better overall performance.  Maybe the best thing to do is just to look at end-to-end performance.  Probably, it doesn't matter at all.  In that case, PtrHash is nicer because it's the less surprising choice.
Comment 4 Darin Adler 2015-10-15 13:26:17 PDT
As I recall, we were originally using PtrHash for this purpose (or a similar one in WebCore?) and Gavin changed to use the string hash instead to get a measurable performance boost. But I don’t think the change was to recompute a string hash each time; our strings memoize the result of the string hash function.
Comment 5 Yusuke Suzuki 2015-10-16 01:56:39 PDT
(In reply to comment #3)
> Do you recommend measuring distribution directly, or do you think it's OK to
> just see how performance is affected overall?

I was thinking about HashTable::dumpStats.
DUMP_HASHTABLE_STATS can dump the statistics including collisions.

> We may encounter a trade-off between a fast to compute hash function that
> gives worse distribution and a slow to compute hash function that gives
> better distribution.  I don't have a philosophical preference between the
> two, but I just suspect that unless the distribution is really bad, the
> faster-to-compute hash function will give better overall performance.  Maybe
> the best thing to do is just to look at end-to-end performance.  Probably,
> it doesn't matter at all.  In that case, PtrHash is nicer because it's the
> less surprising choice.

Agreed. If the change does not pose any performance regression & there are enough tests, I think it indirectly indicates the distribution of PtrHash is not so bad.