User Defined Functions » StringHash » Analysis

This is an analysis of bit transmutation and distribution patterns of the StringHash function, mainly in comparison to the "Java hash", a variant of Dan Bernstein's old "33" hash posted long ago in the comp.lang.c newsgroup.

Reference code

Java hash

  for i = 1, len do 
    counter = (counter*31) + string.byte(text, i)
    counter = mod(counter, 4294967296);	-- 2^32
  end; 
  return mod(counter, 4294967296);  -- 2^32

This is the function implemented in the Java String.hashCode library call.

The multiplication by 31 basically translates to "new = old<<5 - old", i.e. you never completely lose data from earlier bytes as you would by only shifting data up. The relation to Bernstein's original "33" function is pretty clear; it simply uses 33 instead of 31.

The mod() is necessary in Lua, otherwise the value turns floating-point and you lose precision after ~10 characters; Lua, at least under x86, uses 64-bit ("double-precision") IEEE 764 floats, which have a mantissa of 52 bits.

This function would be slightly better with a prime modulus, but 2^32 is chosen as a modulus to emulate 32-bit integer math as used in the Java SDK.

StringHash

  for i = 1, len, 3 do 
    counter = mod(counter*8161, 4294967279) +  -- 2^32 - 17: Prime!
  	  (string.byte(text,i)*16776193) +
  	  ((string.byte(text,i+1) or (len-i+256))*8372226) +
  	  ((string.byte(text,i+2) or (len-i+256))*3932164);
  end; 
  return mod(counter, 4294967291); -- 2^32 - 5: Prime (and different from the prime in the loop)

This string hash is ~30% faster than the Java hash due to working with 3 characters at a time; the modulus call is fairly expensive compared to the other operations.

The multiplications are all chosen to affect 3 areas of the accumulator at a time:

     8161 =     8192 -     32 + 1  : 2<<13,  2<< 5,  2<<0
 16776193 = 16777216 -   1024 + 1  : 2<<24,  2<<10,  2<<0
  8372226 =  8388608 -  16384 + 2  : 2<<23,  2<<14,  2<<5
  3932164 =  4194304 - 262144 + 4  : 2<<22,  2<<18,  2<<10

It is tempting to make the first multiplication much larger, but beware that the result of (4294967279+(256*29080583))*8161 must fit in a 52-bit mantissa without losing precision. The result is currently a 48-bit number so meets that contraint.

Tests performed and result analysis

Below is an array of tests that both algorithms have been submitted to. For a hash analysis, these tests are fairly basic and the result analysis is somewhat unscientific, but they will do to illustrate the difference between the Java hash and StringHash, which is all this document set out to do.

Performance test - raw measurement of relative hash speed
Bit distribution test - how often is each bit of the 32-bit output used?
Collision frequency test - compare observed collision statistics to expected numbers
Bit transmutation test - determine how many bits are changed as a result of 1 input bit changing
Collision finding test - find and count collisions for exhaustive input sets

Performance test

This test simply hashes the string "abcdefghijklmnopqrstuvwxyz" in a loop 20 000 times and clocks the start and stop time. The loop is included in the resulting time.

Java hash execution time: 1.719s
StringHash execution time: 1.313s

Result analysis

StringHash is slightly faster than the Java hash; likely not enough to matter, but certainly enough that the perceived "increased size" of StringHash can be discounted as an argument against it - it's actually faster!

The reason is of course that the math.mod() call is relatively expensive compared to extracting characters from a string and performing multiplications. StringHash does 3 extract + multiply per mod() call. The Java hash in its standard implementation has to do one mod() call per character.

On the testing computer, a P4 at ~3Ghz, StringHash hashes in the neighborhood of half a megabyte per second; certainly not bad for a scripted language using only standard math library operations!

Bit distribution test

This test measures how many times each bit is 1 for a given set of inputs. The target value is for each bit to be 1 50% of the time, and 0 50% of the time, though there will obviously be deviations from this, which are expected and accepted, up to a certain point.

In this test, we feed each hash the same set of inputs: 10000 random strings consisting of "aaaaaaaaa" followed by 15 random bytes.

The results show the absolute difference between 50% and the result, i.e. 49.4 and 50.6 will both be reported as "0.6". Only differences of 0.5% or greater are shown (0.5% is in no way a ceiling of the expected difference; it just cuts down on the amount of output).

Inputs 0-255

Java

StringHash

  2: 0.8%
  4: 0.6%
  9: 0.6%
 11: 0.5%
 13: 0.5%
 14: 0.5%
 15: 1.0%
 17: 1.1%
 20: 0.5%
 21: 0.9%
 23: 0.8%
 25: 0.9%
 28: 0.6%
 31: 0.7%
 32: 1.2%

  2: 0.5%
  3: 0.5%
  7: 0.7%
 12: 0.6%
 15: 0.9%
 16: 1.0%
 17: 0.9%
 20: 0.7%
 21: 1.1%
 25: 0.6%
 29: 0.8%
 30: 0.6%
 31: 0.7%

Inputs 32-127

Java	StringHash
2: 0.9% 7: 0.7% 8: 0.6% 9: 0.7% 10: 0.6% 11: 0.6% 12: 1.5% 15: 0.8% 16: 0.5% 18: 0.7% 22: 0.8% 27: 0.5% 29: 0.5% 30: 0.6% 32: 0.5%	1: 1.0% 3: 0.5% 6: 0.8% 14: 0.6% 19: 0.5% 23: 0.5% 25: 0.7% 28: 0.5%

Result analysis

Both algorithms perform roughly as expected. With these samples, StringHash performs a little bit better, but that is to be expected as it mixes more bits; either way, the differences are not really meaningful. The test only attempts to evaluate if all possible result bits are used efficiently without unexpected "holes" in the output.

A more thorough test would attempt to find biases in different sets of bits, but that is really more meaningful for PRNG testing and secure hashes than it is for this type of light string hash.

Collision frequency test

This test takes the result of 20000 random strings consisting of ASCII characters 0-255, and selects 15 bits for testing. The resulting value becomes 0--32767, which, for anything other than a perfect hash, will result in a fair amount of collisions. We expect the frequency of collisions to match a Poisson distribution and compare the test results to it.

Long random string 32-127 bits 1-15

     Java   SH     Expected
  1: 10940  10981  10863
  2:  6552   6554   6630
  3:  2043   1998   2023
  4:   388    396    411
  5:    65     65     62
  6:    12      6      7

Long random string 0-255 bits 1-15

     Java   SH     Expected
  1: 10717  10728  10863
  2:  6796   6724   6630
  3:  1995   2073   2023
  4:   376    420    411
  5:   110     55     62
  6:     6             7

Long random string 0-255 bits 6-20

     Java   SH     Expected
  1: 10788  10875  10863
  2:  6768   6628   6630
  3:  1989   1971   2023
  4:   400    460    411
  5:    55     60     62
  6:            6      7

Long random string 0-255 bits 12-26

     Java   SH     Expected
  1: 10917  10791  10863
  2:  6630   6612   6630
  3:  1881   2049   2023
  4:   492    452    411
  5:    80     90     62
  6:            6      7

Long random string 0-255 bits 18-32

     Java   SH     Expected
  1: 10828  10795  10863
  2:  6590   6860   6630
  3:  2127   1932   2023
  4:   380    360    411
  5:    75     35     62
  6:           18      7

Result analysis

Both hashes perform roughly as expected without showing too great variations over different bits; there are one or two aberrations in the higher collision counts, but that may just be the result of a too-small sample size. I will not investigate further as this test is only a rough indication of performance.

A more complete analysis would include testing smaller modulii, and certainly ones that are not a power of 2. A modulus constructed around series involving 31 and 32 will most likely produce "interesting" results for the Java hash, but that is an expected side effect of its construction.

Bit transmutation test

This test measures how many bits of the hash changes as a result of a single input bit changing, i.e. how close to an avalanche effect we're getting. The result is described as the number of times a given number of bits have been seen to change for the entire run.

The input starts out as a string of 0s (ASCII NULs), flipping bits on and off one at a time. Then a number of runs are executed starting out with random values.

All the below tests work with 13-byte strings, and execute for 50 runs, i.e. one run based on ASCII NULs, and 49 runs with fresh random numbers each time.

A good generic hash algorithm will tend toward flipping 50% of the bits for each input bit changed, though that is obviously unattainable. We do however expect a fair hump around 13-17 bits changed, and very low counts toward the extremes. 0 bits flipping may indicate a broken hash algorithm (the hash result didn't change at all as a result of changing the input).

Java, flip all bits in bytes 1 through 13

  0:     0     1:   173     2:   208     3:   194
  4:   188     5:   159     6:   160     7:   150
  8:   200     9:   231    10:   228    11:   292
 12:   343    13:   476    14:   503    15:   469
 16:   418    17:   318    18:   210    19:   143
 20:    69    21:    45    22:    11    23:     9
 24:     2    25:     1    26:     0    27:     0
 28:     0    29:     0    30:     0    31:     0
 32:     0

This test does not show 0 bits changing, but the changes we do see are spread quite flat over the spectrum. There is a hump around 13-17 but not as marked as it should be. We get quite a few single bit changes, which is quite expected given the algorithm. The last few bytes in the string do not end up affecting very much of the end result.

StringHash, flip all bits in bytes 1 through 13

  0:     0     1:     0     2:     0     3:    45
  4:    70     5:    77     6:    71     7:    56
  8:    45     9:    53    10:   115    11:   203
 12:   327    13:   405    14:   533    15:   588
 16:   620    17:   593    18:   499    19:   371
 20:   259    21:   146    22:    76    23:    32
 24:    12    25:     3    26:     1    27:     0
 28:     0    29:     0    30:     0    31:     0
 32:     0

StringHash has a much more pronounced hump around the 16 bits changed mark. There are no single bit changes, and no two-bit changes in the tests. This is expected as each bit changed is guaranteed to affect at least three bits of the hash, even for the last few bytes of the input.

Java, flip all bits in bytes 12 through 13

  0:     0     1:   173     2:   208     3:   154
  4:   108     5:    75     6:    53     7:    17
  8:     6     9:     1    10:     3    11:     2
 12:     0     (rest 0)

Here is where the weakness of the hash shows the most: changes in the last few bytes affect relatively few bits, thereby increasing chances for collisions.

StringHash, flip all bits in bytes 12 through 13

  0:     0     1:     0     2:     0     3:    45
  4:    70     5:    77     6:    71     7:    56
  8:    36     9:    26    10:    36    11:    40
 12:    54    13:    51    14:    57    15:    49
 16:    42    17:    39    18:    23    19:    15
 20:     9    21:     3    22:     1    23:     0
 24:     0    (rest 0)

While not precisely shining compared to strong hashes, StringHash stirs quite a few more bits over a much larger spectrum.

Result analysis

StringHash outperforms the Java hash by distributing bit changes over more result bits, thus reducing the collision chance.

Collision finding test

This test exhaustively computes all possible values of a given input set and tests for collisions. The size of the input sets is unfortunately constrained by RAM and CPU, but it is believed to be a fair indication of hash strength.

ASCII codes 64-89 (upper case A-Z), 3 bytes

JavaHash
  17576 non-colliding values

StringHash
  17576 non-colliding values

ASCII codes 64-89 (upper case A-Z), 4 bytes

JavaHash
  456976 non-colliding values

StringHash
  456976 non-colliding values

ASCII codes 32-127 (all printable), 2 bytes

JavaHash
     62 non-colliding values
     62 instances of 2 inputs producing the same hash (124 strings)
   2638 instances of 3 inputs producing the same hash (7914 strings)
    279 instances of 4 inputs producing the same hash (1116 strings)

StringHash
   9216 non-colliding values

Already here, with only searching all possible combinations of 2 letters of printable ASCII, the Java hash is showing poor results.

ASCII codes 32-127 (all printable), 3 bytes

This is about as exhaustive we can go in Lua without resulting to data compression techniques for the result sets: all possible combinations of 3 letters of printable ASCII for a total of 884736 outputs.

JavaHash
     62 non-colliding values
     62 instances of 2 inputs producing the same hash (124 strings)
   1630 instances of 3 inputs producing the same hash (4890 strings)
    224 instances of 4 inputs producing the same hash (896 strings)
     62 instances of 5 inputs producing the same hash (310 strings)
   1630 instances of 6 inputs producing the same hash (9780 strings)
     62 instances of 7 inputs producing the same hash (434 strings)
    224 instances of 8 inputs producing the same hash (1792 strings)
  68606 instances of 9 inputs producing the same hash (617454 strings)
   5214 instances of 10 inputs producing the same hash (52140 strings)
   5214 instances of 11 inputs producing the same hash (57354 strings)
   9672 instances of 12 inputs producing the same hash (116064 strings)
    558 instances of 13 inputs producing the same hash (7254 strings)
    558 instances of 14 inputs producing the same hash (7812 strings)
    558 instances of 15 inputs producing the same hash (8370 strings)

StringHash
  884736 non-colliding values

ASCII codes 32-127 (all printable), 3 bytes, with prefix

The same input values as the 32-127 x 3 test, except all prefixed with the string "01234567890123456789". This is to test for loss of accuracy.

JavaHash
     62 non-colliding values
     62 instances of 2 inputs producing the same hash (124 strings)
   1630 instances of 3 inputs producing the same hash (4890 strings)
    224 instances of 4 inputs producing the same hash (896 strings)
     62 instances of 5 inputs producing the same hash (310 strings)
   1630 instances of 6 inputs producing the same hash (9780 strings)
     62 instances of 7 inputs producing the same hash (434 strings)
    224 instances of 8 inputs producing the same hash (1792 strings)
  68606 instances of 9 inputs producing the same hash (617454 strings)
   5214 instances of 10 inputs producing the same hash (52140 strings)
   5214 instances of 11 inputs producing the same hash (57354 strings)
   9672 instances of 12 inputs producing the same hash (116064 strings)
    558 instances of 13 inputs producing the same hash (7254 strings)
    558 instances of 14 inputs producing the same hash (7812 strings)
    558 instances of 15 inputs producing the same hash (8370 strings)

StringHash
  884736 non-colliding values

ASCII codes 32-127 (all printable), 3 bytes, with suffix

The same input values as the 32-127 x 3 test, except all suffixed with the string "01234567890123456789". This is to test for loss of "old" information (data from earlier on in the string) in the mixing function.

JavaHash
     62 non-colliding values
     62 instances of 2 inputs producing the same hash (124 strings)
   1630 instances of 3 inputs producing the same hash (4890 strings)
    224 instances of 4 inputs producing the same hash (896 strings)
     62 instances of 5 inputs producing the same hash (310 strings)
   1630 instances of 6 inputs producing the same hash (9780 strings)
     62 instances of 7 inputs producing the same hash (434 strings)
    224 instances of 8 inputs producing the same hash (1792 strings)
  68606 instances of 9 inputs producing the same hash (617454 strings)
   5214 instances of 10 inputs producing the same hash (52140 strings)
   5214 instances of 11 inputs producing the same hash (57354 strings)
   9672 instances of 12 inputs producing the same hash (116064 strings)
    558 instances of 13 inputs producing the same hash (7254 strings)
    558 instances of 14 inputs producing the same hash (7812 strings)
    558 instances of 15 inputs producing the same hash (8370 strings)

StringHash
  884736 non-colliding values

ASCII codes 64-82 (A-S), 5 bytes

The Lua memory churn becomes somewhat ugly here, consuming several hundred megabytes. Even attempting A-Z with the same length brings a machine with 1GB RAM to its knees.

JavaHash
  2476099 non-colliding values

StringHash
  2476099 non-colliding values

Result analysis

The Java hash is shown to produce many more collisions even for rather simple inputs. The only cases where it behaves well are the A-Z cases; as long as the input subset is constricted to 32 unique values, it is collision free for 6 bytes (6 * 5 bits = 30 bits), in other words, it is a perfect hash for e.g. short strings consisting of only upper case ASCII letters A-Z. This is an expected property of the algorithm. Though taken with the rather abysmal results for more generic strings above (or, indeed, any sort of string longer than 6 bytes), the Java hash demonstrably makes for a poor choice for general input.

Result summary

In the two simplest cases: collision frequency and bit distribution, the two hashes seem comparable. But on comparing bit transmutation patterns, StringHash tends toward much better spreads of input bits to output bits.

The theory is confirmed in the exhaustive collision search tests, where StringHash remains collision free for the small data sets used, while the Java hash produces more collisions (double, tripe, quadruple, etc) than it does collision-free results.

Added to this, the Lua implementation of StringHash is slightly faster than the Lua implementation of the Java hash (at least in its most straight-forward implementation).

Applicability to other languages

StringHash was designed to be reasonably collision-resistant and fast in a Lua standard library environment. Other interpreted languages capable of 50-bit integer math (i.e. through using 64-bit floats) should also be able to make a fair use of it.

Languages with access to lower level primitives (XOR, ANDs, rotations) would do better to look at e.g. Bob Jenkins' lookup3 hash, or at least read his page on hash analysis.

Appendix: Lua hash analysis code

mod=math.mod;

-- This is the "Java hash", a variant of Dan Bernstein's old "33" hash posted long ago in the comp.lang.c newsgroup.
-- The *31 basically translates to "new = old<<5 - old", i.e. you never completely lose data from earlier bytes as you would by only shifting
-- The mod is necessary in Lua; otherwise the value turns floating-point and you lose precision after ~10 characters
-- This function would be better with a prime modulus, but 2^32 is chosen as a mod to emulate the 32-bit math used in the original
function JavaHash(text)
  local counter = 0; 
  local len = string.len(text); 
  for i = 1, len do 
    counter = (counter*31) + string.byte(text, i)
    counter = mod(counter, 4294967296);	-- 2^32
  end; 
  return mod(counter, 4294967296);  -- 2^32
end


-- This string hash is ~30% faster due to working with 3 characters at a time; the mod expression is fairly expensive
-- The multiplications are all chosen to affect 3 areas of the accumulator at a time:
--      8161 =     8192 -     32 + 1  : 2<<13,  2<< 5,  2<<0
--  16776193 = 16777216 -   1024 + 1  : 2<<24,  2<<10,  2<<0
--   8372226 =  8388608 -  16384 + 2  : 2<<23,  2<<14,  2<<5
--   3932164 =  4194304 - 262144 + 4  : 2<<22,  2<<18,  2<<10
function StringHash(text)
  local counter = 1;
  local len = string.len(text); 
  for i = 1, len, 3 do 
    counter = mod(counter*8161, 4294967279) +  -- 2^32 - 17: Prime!  (counter*8161 is at most a 48-bit number, which easily fits in the 52-bit mantissa of a double precision float)
  	  (string.byte(text,i)*16776193) +
  	  ((string.byte(text,i+1) or (len-i+256))*8372226) +
  	  ((string.byte(text,i+2) or (len-i+256))*3932164);
  end; 
  return mod(counter, 4294967291); -- 2^32 - 5: Prime (and different from the prime in the loop)
end




-- Utility function: return number of bits changed between a and b
function bitsdiff(a,b)
  local n=0;
  for i=1,32 do
    if(mod(a,2) ~= mod(b,2)) then n=n+1; end
    a=math.floor(a/2);
    b=math.floor(b/2);
  end
  return n;
end


function tcopy(to,from)
  for k,v in pairs(from) do
  	to[k]=v;
  end
end
  

-- Bit transmutation test: Count how many bits change as a result of a single bit of input flipping
function transtest(func,num,  len, first, last)
  local bitsdiffhits = {}
  
  local origstrbytes = { };
  for c=1,len do
  	table.insert(origstrbytes, 0);
  end
  
  math.randomseed(2);

  for rep=1,num do
  	local strbytes = {};
  	for c=first,last do
  		tcopy(strbytes, origstrbytes);
  		local bitval=1;
  		for b=1,8 do
  			local before = func(string.char(unpack(strbytes)));
  			strbytes[c] = origstrbytes[c];
  			strbytes[c] = math.mod(origstrbytes[c] + bitval, 256);
  			local after = func(string.char(unpack(strbytes)));
  			local n = bitsdiff(before,after);
  			bitsdiffhits[n] = (bitsdiffhits[n] or 0) + 1;
  			bitval=bitval*2;
  		end
  	end
  	
  	for c=1,table.getn(origstrbytes) do
  		origstrbytes[c]=math.random(0,255);
  	end
  end
  local score=0;
  for i=0,32 do
  	score = score - (bitsdiffhits[i] or 0)*(math.abs(16-i)/16);	-- 16 = 50% of the bits, which is how many bits we want to change in a perfect distribution
  	print(string.format("  %2u: %5u", i, bitsdiffhits[i] or 0));
  end
  print("  SCORE: "..score);	-- 0 is unimaginatively good. negative is bad.
end


-- Performance test: Just measure the time it takes to execute a hash on a fixed string lots of times (including the cost of the loop, obviously)
function perftest(func, num)
  local x=os.clock();
  for i=1,num do
    func("abcdefghijklmnopqrstuvwxyz");
  end
  return os.clock()-x;
end


-- Count the number of single, double, triple etc collisions given a certain manipulation of the resultant hash:
--   div by X, take modulus Y
-- The modulus is there to create an artificial range clamping; testing would take too long otherwise.
-- Should match expected Poisson distribution closely for an unbiased hash.
function collisiontest(func, num, div,modulus, randfunc)
  local x = os.clock();
  local outputs = {}
  local collisions = {}
  math.randomseed(1);
  for i=1,num do
  	str = randfunc();
  	local n = math.mod(math.floor(func(str)/div), modulus);
  	outputs[n] = (outputs[n] or 0)+1;
  end	
  for _,n in pairs(outputs) do
  	collisions[n] = (collisions[n] or 0) + n;
  end
  local y = os.clock();
  local sortcol = {}
  for k,v in pairs(collisions) do
  	table.insert(sortcol, k);
  end
  table.sort(sortcol);

  -- Compute expected Poisson distribution
  expect = {}
  lambda = num/modulus;
  fly = lambda * math.exp(-lambda);
  for c=1,10 do
  	expect[c] = fly * modulus;
  	fly = fly * lambda / c;
  end
  
  -- Output results and expected results
  for _,k in ipairs(sortcol) do
  	print(string.format("  %2u: %5u  (expected %5u)", k, collisions[k], expect[k] or -1));
  end
  return y-x;
end


-- Bit distribution test: Measure number of times each bit is set, should average out to 50% chance in the end
-- Will be presented as the absolute difference from 50%
function bitdistribtest(func,num,lo,hi,dispmin)
  math.randomseed(1);
  local bitsset={};
  for i=1,num do
  	str = string.char(math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),
  		math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),
  		math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi));
  	local n = func(str);
  	for k=1,32 do
  		if(mod(n,2)==1) then
  			bitsset[k] = (bitsset[k] or 0)+1;
  		end
  		n=math.floor(n/2);
  	end
  end	
  for k=1,32 do
  	local diff = math.abs(50-(bitsset[k] or 0)/num*100);
  	if(diff > dispmin) then
  		print(string.format("  %2u: %.1f%%", k, diff));
  	end
  end
end



-- Collision finder test: Loop all possible values in given set and count collisions (yes, this is slow, and uses quite a bit of ram)
function collisionfinder(func,  lo, hi,  bytes,  prefix, suffix)
  local str;
  local coll={};
  prefix = prefix or "";
  suffix = suffix or "";
  
  if(bytes==2) then
  	for b=lo,hi do
  		for a=lo,hi do
  			str = prefix .. string.char(a,b) .. suffix;
  			local n = func(str);
  			coll[n] = (coll[n] or 0) + 1;
  		end
  	end
  elseif(bytes==3) then
  	for c=lo,hi do
  		for b=lo,hi do
  			for a=lo,hi do
  				str = prefix .. string.char(a,b,c) .. suffix;
  				local n = func(str);
  				coll[n] = (coll[n] or 0) + 1;
  			end
  		end
  	end
  elseif(bytes==4) then
  	for d=lo,hi do
  		for c=lo,hi do
  			for b=lo,hi do
  				for a=lo,hi do
  					str = prefix .. string.char(a,b,c,d) .. suffix;
  					local n = func(str);
  					coll[n] = (coll[n] or 0) + 1;
  				end
  			end
  		end
  	end
  elseif(bytes==5) then
  	for e=lo,hi do
  		for d=lo,hi do
  			for c=lo,hi do
  				for b=lo,hi do
  					for a=lo,hi do
  						str = prefix .. string.char(a,b,c,d,e) .. suffix;
  						local n = func(str);
  						coll[n] = (coll[n] or 0) + 1;
  					end
  				end
  			end
  		end
  	end
  end
  local collcnt = {};
  for k,n in coll do
  	collcnt[n] = (collcnt[n] or 0) + 1;
  end
  
  for colls, count in pairs(collcnt) do
  	if(colls==1) then
  		print(string.format("  %5u non-colliding values", count));
  	else
  		print(string.format("  %5u instances of %u inputs producing the same hash (%u strings)", count, colls, count*colls));
  	end
  end
end






---------------------------------------------------------------------------
-- Run the tests


if(true) then
  
  local colls;

  
  local function longrand(lo,hi)
  	return function() 
  		local ret = "";
  		while colls[ret] do
  			ret = "aaaaaaaaa"..string.char(math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),
  				math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),
  				math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi));
  		end
  		colls[ret] = true;
  		return ret;
  	end
  end
  
  print "";


  
  print "";
  print "Collision test - long random string 32-127 (20000 values into 32768 buckets)";
  print "------------------------------------------"

  print("JavaHash:");
  colls = {};
  print(collisiontest(JavaHash, 20000, 1,32768, longrand(32,127)));
  
  print("StringHash:");
  colls = {};
  print(collisiontest(StringHash, 20000, 1,32768, longrand(32,127)));



  print "";
  print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 1-15)";
  print "------------------------------------------"

  print("JavaHash:");
  colls = {};
  print(collisiontest(JavaHash, 20000, 1,32768, longrand(0,255)));
  
  print("StringHash:");
  colls = {};
  print(collisiontest(StringHash, 20000, 1,32768, longrand(0,255)));



  print "";
  print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 6-20)";
  print "------------------------------------------"

  print("JavaHash:");
  colls = {};
  print(collisiontest(JavaHash, 20000, math.pow(2,5),32768, longrand(0,255)));
  
  print("StringHash:");
  colls = {};
  print(collisiontest(StringHash, 20000, math.pow(2,5),32768, longrand(0,255)));



  print "";
  print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 12-26)";
  print "------------------------------------------"

  print("JavaHash:");
  colls = {};
  print(collisiontest(JavaHash, 20000, math.pow(2,11),32768, longrand(0,255)));
  
  print("StringHash:");
  colls = {};
  print(collisiontest(StringHash, 20000, math.pow(2,11),32768, longrand(0,255)));



  print "";
  print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 18-32)";
  print "------------------------------------------"

  print("JavaHash:");
  colls = {};
  print(collisiontest(JavaHash, 20000, math.pow(2,17),32768, longrand(0,255)));
  
  print("StringHash:");
  colls = {};
  print(collisiontest(StringHash, 20000, math.pow(2,17),32768, longrand(0,255)));


end


if(true) then
  print "";
  print "";
  print "Performance test";
  print "-----------------"
  
  print("JavaHash  speed: ", perftest(JavaHash, 20000));
  print("StringHash speed: ", perftest(StringHash, 20000));
end


if(true) then
  print "";
  print "";
  print "Bit transmutation test (how many bits change if a single bit changes?)";
  print "----------------------";
  
  print("");
  print("JavaHash 13 bytes total, change 1-13:");
  transtest(JavaHash,50, 13, 1,13);

  print("");
  print("StringHash 13 bytes total, change 1-13:");
  transtest(StringHash,50, 13, 1,13);
  
  
  print("");
  print("JavaHash 13 bytes total, change 1-2:");
  transtest(JavaHash,50, 13, 1,2);

  print("");
  print("StringHash 13 bytes total, change 1-2:");
  transtest(StringHash,50, 13, 1,2);

  
  print("");
  print("JavaHash 13 bytes total, change 12-13:");
  transtest(JavaHash,50, 13, 12,13);

  print("");
  print("StringHash 13 bytes total, change 12-13:");
  transtest(StringHash,50, 13, 12,13);
  
  
end


if(true) then
  print "";
  print "";
  print "Bit distribution test (what's the chance of each bit being set? expressed as abs diff from 50%)";
  print "---------------------";
  
  print("");
  print("JavaHash 0-255, 10000 loops: (showing diffs>0.4%)");
  bitdistribtest(JavaHash,10000,0,255, 0.4);

  print("");
  print("StringHash 0-255, 10000 loops: (showing diffs>0.4%)");
  bitdistribtest(StringHash,10000,0,255, 0.4);

  print("");
  print("JavaHash 32-127, 10000 loops: (showing diffs>0.4%)");
  bitdistribtest(JavaHash,10000,32,127, 0.4);

  print("");
  print("StringHash 32-127, 10000 loops: (showing diffs>0.4%)");
  bitdistribtest(StringHash,10000,32,127, 0.4);

end


if(true) then
  print "";
  print "";
  print "Collision finder (run through all possible values of a set and count collisions)";
  print "----------------";

  
  print("");
  print("JavaHash, 64-89 x 3 bytes");
  collisionfinder(JavaHash, 64, 89, 3);

  print("");
  print("StringHash, 64-89 x 3 bytes");
  collisionfinder(StringHash, 64, 89, 3);


  print("");

  
  print("");
  print("JavaHash, 64-89 x 4 bytes");
  collisionfinder(JavaHash, 64, 89, 4);

  print("");
  print("StringHash, 64-89 x 4 bytes");
  collisionfinder(StringHash, 64, 89, 4);


  print("");


  print("");
  print("JavaHash, 32-127 x 2 bytes");
  collisionfinder(JavaHash, 32, 127, 2);

  print("");
  print("StringHash, 32-127 x 2 bytes");
  collisionfinder(StringHash, 32, 127, 2);


  print("");


  print("");
  print("JavaHash, 32-127 x 3 bytes");
  collisionfinder(JavaHash, 32, 127, 3);

  print("");
  print("StringHash, 32-127 x 3 bytes");
  collisionfinder(StringHash, 32, 127, 3);


  print("");

  
  print("");
  print("JavaHash, 32-127 x 3 bytes, prefix '01234567890123456789'");
  collisionfinder(JavaHash, 32, 127, 3, "01234567890123456789");

  print("");
  print("StringHash, 32-127 x 3 bytes, prefix '01234567890123456789'");
  collisionfinder(StringHash, 32, 127, 3, "01234567890123456789");


  print("");


  print("");
  print("JavaHash, 32-127 x 3 bytes, suffix '01234567890123456789'");
  collisionfinder(JavaHash, 32, 127, 3, "", "01234567890123456789");

  print("");
  print("StringHash, 32-127 x 3 bytes, suffix '01234567890123456789'");
  collisionfinder(StringHash, 32, 127, 3, "", "01234567890123456789");



  print("");		-- This test takes a looooong time
  if(true) then
  	print("");
  	print("JavaHash, 64-82 x 5 bytes");
  	collisionfinder(JavaHash, 64, 82, 5);
  
  	print("");
  	print("StringHash, 64-82 x 5 bytes");
  	collisionfinder(StringHash, 64, 82, 5);
  end
  
end

Navigation menu

Search

StringHash/Analysis

Reference code

Java hash

StringHash

Tests performed and result analysis

Performance test

Result analysis

Bit distribution test

Inputs 0-255

Inputs 32-127

Result analysis

Collision frequency test

Long random string 32-127 bits 1-15

Long random string 0-255 bits 1-15

Long random string 0-255 bits 6-20

Long random string 0-255 bits 12-26

Long random string 0-255 bits 18-32

Result analysis

Bit transmutation test

Java, flip all bits in bytes 1 through 13

StringHash, flip all bits in bytes 1 through 13

Java, flip all bits in bytes 12 through 13

StringHash, flip all bits in bytes 12 through 13

Result analysis

Collision finding test

ASCII codes 64-89 (upper case A-Z), 3 bytes

ASCII codes 64-89 (upper case A-Z), 4 bytes

ASCII codes 32-127 (all printable), 2 bytes

ASCII codes 32-127 (all printable), 3 bytes

ASCII codes 32-127 (all printable), 3 bytes, with prefix

ASCII codes 32-127 (all printable), 3 bytes, with suffix

ASCII codes 64-82 (A-S), 5 bytes

Result analysis

Result summary

Applicability to other languages

Appendix: Lua hash analysis code