[devel] quick survey: Bernstein hash vs Jenkins hash
Alexey Tourbin
at на altlinux.ru
Вс Авг 15 20:05:29 UTC 2010
Summary: to implement soname set-versions, we need a hash function for
ELF symbols. Despite its theoretical weakness, djb hash seems to perform
slightly better than Bob Jenkins' "one-at-a-time" hash.
In soname set-versions, each ELF symbol will be represented with its hash
value, using typically 20 bits per symbol. Hence we need a hash function
which does a good job. Badly. However, since 20 bits is not much more
than a completely ridiculous number of bits, basically we don't need
a function which is cryptographically strong and which can guard against
attacks. We only need a function which plays very well with real-world
data, in terms of collisions.
If you're unfamiliar with the subject, here's Bob Jenkins' survey:
http://burtleburtle.net/bob/hash/doobs.html
There's quite a few hash functions which might serve our purpose.
To name a few: Jenkins' one-at-a-time hash is used in perl; libstdc++
makes use of FNV hashing; in GNU_HASH, they resort to age-old djb hash.
Below I attempt to compare djb hash and Jenkins' hash against each other.
There's a subtle point, though: djb does not "mix" bits; to obtain
a smaller hash value, one is better off with modulo prime. Jenkins does
mix bits, and smaller hash values can be obtained with bitmask. Now note
that, to match hash values of different size, we have to use bitmask -
modulo prime is simply not an option. However, it seems that djb hash
performs quite well even with bitmask.
So, I have a program hash-strings.c, and I have some packages under
/var/cache/apt/archives. Let's see how 'j' and 'b' 20-bits hashes
differ in terms of collisions.
$ gcc -Wall -O hash-strings.c -lcrypto -o hash-strings
Prepare some data:
$ rpmelfsym.pl /var/cache/apt/archives |cut -f4 |./hash-strings b 20 |sort -u >b20
$ rpmelfsym.pl /var/cache/apt/archives |cut -f4 |./hash-strings j 20 |sort -u >j20
$ wc -l b20 j20
205364 b20
205364 j20
410728 total
$
If you can't get it, here is what it is:
$ head b20
ACCESS_DESCRIPTION_free 00096a3b
ACCESS_DESCRIPTION_it 0009f4f6
ACCESS_DESCRIPTION_new 0008a783
ACL_1.0 0006e163
ACL_1.1 0006e164
AESKeyWrap_Decrypt 0009acbb
AESKeyWrap_DestroyContext 000b11af
AESKeyWrap_Encrypt 00025725
AES_Decrypt 0004c798
AES_DestroyContext 0009b1ec
$
Unique hash values:
$ sort -u -k2 b20 |wc -l
186769
$ sort -u -k2 j20 |wc -l
186490
$
Total number of collisions:
$ sort -k2 b20 |uniq -D -f1 |wc -l
36057
$ sort -k2 j20 |uniq -D -f1 |wc -l
36567
$
Number of collision clusters:
$ sort -k2 b20 |uniq -c -f1 |awk '$1>1' |wc -l
17462
$ sort -k2 j20 |uniq -c -f1 |awk '$1>1' |wc -l
17693
$
Number of collisions in a cluster:
$ sort -k2 b20 |uniq -c -f1 |awk '$1>1' |sort -n |tail
4 _ZNSs2atEm 000b350b
4 _ZTV13wxRendererGTK 000ddc38
4 _ZTVN9functions11rootsectionE 0005a028
4 ___G__23__23_make_2d_chartable 00004386
4 ___G_c_23_set_2d_val 000afb31
4 _lsa_RemoveAccountRights 00012b0f
4 asn1buf_remains 0002e023
4 cli_rpc_pipe_open_schannel_with_key 00077d4d
4 do_add 00014580
4 pango_font_family_get_name 0008d850
$ sort -k2 j20 |uniq -c -f1 |awk '$1>1' |sort -n |tail
4 _ZNK6OpenSP19ExternalEntityEvent8locationEv 0003b594
4 _ZTI8ppdcFont 000b054d
4 _ZTS13wxDocTemplate 00071287
4 _ZTV16CLimitedInStream 000c02d7
4 ___S_c_23_define_2d_parameterized_2d_decl 000bae87
4 _prelude_async_fork_prepare 0008201f
4 add_setshow_zuinteger_cmd 00012317
4 close_files 0000ceda
5 CONVFMTidx 000aa087
5 _ZN30subselect_single_select_engine7excludeEv 0002810c
$
So, readily admitting me being mundane, djb hash is simply better.
----------- следующая часть -----------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
static
unsigned int bhash(const char *str)
{
unsigned int hash = 5381;
const unsigned char *p = (const unsigned char *) str;
int c;
while ((c = *p++))
hash = (hash << 5) + hash + c;
return hash;
}
static
unsigned int jhash(const char *str)
{
unsigned int hash = 0x9e3779b9;
const unsigned char *p = (const unsigned char *) str;
while (*p) {
hash += *p++;
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
#include <openssl/md5.h>
static
unsigned int mhash(const char *str)
{
unsigned char md5[40];
MD5((const unsigned char *)str, strlen(str), md5);
return md5[0] | (md5[1] << 8) | (md5[2] << 16) | (md5[3] << 24);
}
int main(int argc, char *argv[])
{
assert(argc == 3);
unsigned int (*hashfunc)(const char *str);
switch (*argv[1]) {
case 'b':
hashfunc = bhash;
break;
case 'j':
hashfunc = jhash;
break;
case 'm':
hashfunc = mhash;
break;
default:
assert("hashfunc" == NULL);
break;
}
int bpp = atoi(argv[2]);
assert(bpp > 0 && bpp <= 32);
char *line = NULL;
size_t alloc_size = 0;
ssize_t len;
while ((len = getline(&line, &alloc_size, stdin)) >= 0) {
if (len > 0 && line[len-1] == '\n')
line[--len] = '\0';
if (len == 0)
continue;
unsigned int hash = hashfunc(line);
if (bpp < 32)
hash &= (1 << bpp) - 1;
printf("%s\t%08x\n", line, hash);
}
return 0;
}
Подробная информация о списке рассылки Devel