[devel] quick survey: Bernstein hash vs Jenkins hash

Вс Авг 15 20:05:29 UTC 2010

Summary: to implement soname set-versions, we need a hash function for
ELF symbols.  Despite its theoretical weakness, djb hash seems to perform
slightly better than Bob Jenkins' "one-at-a-time" hash.

In soname set-versions, each ELF symbol will be represented with its hash
value, using typically 20 bits per symbol.  Hence we need a hash function
which does a good job.  Badly.  However, since 20 bits is not much more
than a completely ridiculous number of bits, basically we don't need
a function which is cryptographically strong and which can guard against
attacks.  We only need a function which plays very well with real-world
data, in terms of collisions.

If you're unfamiliar with the subject, here's Bob Jenkins' survey:
http://burtleburtle.net/bob/hash/doobs.html

There's quite a few hash functions which might serve our purpose.
To name a few: Jenkins' one-at-a-time hash is used in perl; libstdc++
makes use of FNV hashing; in GNU_HASH, they resort to age-old djb hash.

Below I attempt to compare djb hash and Jenkins' hash against each other.
There's a subtle point, though: djb does not "mix" bits; to obtain
a smaller hash value, one is better off with modulo prime.  Jenkins does
mix bits, and smaller hash values can be obtained with bitmask.  Now note
that, to match hash values of different size, we have to use bitmask -
modulo prime is simply not an option.  However, it seems that djb hash
performs quite well even with bitmask.

So, I have a program hash-strings.c, and I have some packages under
/var/cache/apt/archives.  Let's see how 'j' and 'b' 20-bits hashes
differ in terms of collisions.

$ gcc -Wall -O hash-strings.c -lcrypto -o hash-strings

Prepare some data:

$ rpmelfsym.pl /var/cache/apt/archives |cut -f4 |./hash-strings b 20 |sort -u >b20
$ rpmelfsym.pl /var/cache/apt/archives |cut -f4 |./hash-strings j 20 |sort -u >j20 
$ wc -l b20 j20
  205364 b20
  205364 j20
  410728 total
$

If you can't get it, here is what it is:

$ head b20
ACCESS_DESCRIPTION_free 00096a3b
ACCESS_DESCRIPTION_it   0009f4f6
ACCESS_DESCRIPTION_new  0008a783
ACL_1.0 0006e163
ACL_1.1 0006e164
AESKeyWrap_Decrypt      0009acbb
AESKeyWrap_DestroyContext       000b11af
AESKeyWrap_Encrypt      00025725
AES_Decrypt     0004c798
AES_DestroyContext      0009b1ec
$ 

Unique hash values:

$ sort -u -k2 b20 |wc -l
186769
$ sort -u -k2 j20 |wc -l
186490
$ 

Total number of collisions:

$ sort -k2 b20 |uniq -D -f1 |wc -l                                                
36057
$ sort -k2 j20 |uniq -D -f1 |wc -l
36567
$ 

Number of collision clusters:

$ sort -k2 b20 |uniq -c -f1 |awk '$1>1' |wc -l
17462
$ sort -k2 j20 |uniq -c -f1 |awk '$1>1' |wc -l
17693
$

Number of collisions in a cluster:

$ sort -k2 b20 |uniq -c -f1 |awk '$1>1' |sort -n |tail
      4 _ZNSs2atEm      000b350b
      4 _ZTV13wxRendererGTK     000ddc38
      4 _ZTVN9functions11rootsectionE   0005a028
      4 ___G__23__23_make_2d_chartable  00004386
      4 ___G_c_23_set_2d_val    000afb31
      4 _lsa_RemoveAccountRights        00012b0f
      4 asn1buf_remains 0002e023
      4 cli_rpc_pipe_open_schannel_with_key     00077d4d
      4 do_add  00014580
      4 pango_font_family_get_name      0008d850
$ sort -k2 j20 |uniq -c -f1 |awk '$1>1' |sort -n |tail
      4 _ZNK6OpenSP19ExternalEntityEvent8locationEv     0003b594
      4 _ZTI8ppdcFont   000b054d
      4 _ZTS13wxDocTemplate     00071287
      4 _ZTV16CLimitedInStream  000c02d7
      4 ___S_c_23_define_2d_parameterized_2d_decl       000bae87
      4 _prelude_async_fork_prepare     0008201f
      4 add_setshow_zuinteger_cmd       00012317
      4 close_files     0000ceda
      5 CONVFMTidx      000aa087
      5 _ZN30subselect_single_select_engine7excludeEv   0002810c
$ 

So, readily admitting me being mundane, djb hash is simply better.
----------- следующая часть -----------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

static
unsigned int bhash(const char *str)
{
    unsigned int hash = 5381;
    const unsigned char *p = (const unsigned char *) str;
    int c;

    while ((c = *p++))
	hash = (hash << 5) + hash + c;

    return hash;
}

static
unsigned int jhash(const char *str)
{
    unsigned int hash = 0x9e3779b9;
    const unsigned char *p = (const unsigned char *) str;
    while (*p) {
	hash += *p++;
	hash += (hash << 10);
	hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

#include <openssl/md5.h>

static
unsigned int mhash(const char *str)
{
    unsigned char md5[40];
    MD5((const unsigned char *)str, strlen(str), md5);
    return md5[0] | (md5[1] << 8) | (md5[2] << 16) | (md5[3] << 24);
}

int main(int argc, char *argv[])
{
    assert(argc == 3);

    unsigned int (*hashfunc)(const char *str);
    switch (*argv[1]) {
    case 'b':
	hashfunc = bhash;
	break;
    case 'j':
	hashfunc = jhash;
	break;
    case 'm':
	hashfunc = mhash;
	break;
    default:
	assert("hashfunc" == NULL);
	break;
    }

    int bpp = atoi(argv[2]);
    assert(bpp > 0 && bpp <= 32);

    char *line = NULL;
    size_t alloc_size = 0;
    ssize_t len;

    while ((len = getline(&line, &alloc_size, stdin)) >= 0) {
	if (len > 0 && line[len-1] == '\n')
	    line[--len] = '\0';
	if (len == 0)
	    continue;
	unsigned int hash = hashfunc(line);
	if (bpp < 32)
	    hash &= (1 << bpp) - 1;
	printf("%s\t%08x\n", line, hash);
    }
    return 0;
}