Saturday, October 15, 2011

Space Restricted String Hashing

Source: CLRS, Introduction to Algorithms, 2nd Edition, Page 236, Problem 11.3-2

Problem:
Suppose that a string of r characters is hashed into m slots by treating it as a radix-128 number and then using the division method. The number m is easily represented as a 32-bit computer word, but the string of r characters, treated as a radix-128 number, takes many words. How can we apply the division method to compute the hash value of the character string without using more than a constant number of words of storage outside the string itself?

Basics:
This problem requires some basic number theory principles, and more specifically modulo arithmetic. The following 2 will be applied to solve this problem. (The second one holds for integers only)
\begin{equation} (a+b)\%m=(a\%m+b\%m)\%m \end{equation} and \begin{equation} (a \cdot b )\%m=((a\%m) \cdot (b\%m))\%m \end{equation}

Solution:
Based on above analysis, the string of r characters can be hashed using division method with constant space. Any string of length r can be converted to a radix-128 number, say, \[n=c_{0} \cdot 128^{0}+c_{1} \cdot 128^{1}+c_{2} \cdot 128^{2} + \ldots + c_{r-1} \cdot 128^{r-1},\] where \(c_{i}\) means the \(i\)th character starting from the least significant end with 0 (which can also be viewed as an integer of its ASCII code).

Hence, apply the division hashing function, \begin{eqnarray} n\%m &=& (c_{0} \cdot 128^{0}+c_{1} \cdot 128^{1} + \ldots + c_{r-1} \cdot 128^{r-1})\%m \\ &=& ((c_{0} \cdot 128^{0})\%m+(c_{1} \cdot 128^{1})\%m + \ldots + (c_{r-1} \cdot 128^{r-1})\%m)\%m \\ \end{eqnarray} However, it looks that we need first compute some large numbers, for example, \(128^{r-1}\) of the last term. If so, we still need many words to store them. But based on the second rule mentioned above, we don't have to store such large numbers. All we need to do is take the modulo operation after each multiplication. In this way, only constant number of words for storage is needed.

Here is the code.


However, the above code involves nested loops, which is unnecessary. Based on Horner's rule, we can rewrite \(n\) as \[n=c_0+128(c_1+128(c_2+...+128(c_{r-2}+128\cdot c_{r-1}) \ldots))\]
This is also the basis for Rabin-Karp algorithm. Now, the code can be improved as below.

3 comments:

  1. I think there is one mistake in the last program, the loop should run in a reverse order:
    for (int i = r - 1; i >= 0; i--)

    ReplyDelete
    Replies
    1. Thanks for the comments. But I think the original order was correct. Note that I was looping from the most significant bit to least, where \(s[i] = c_{r-1-i}\). So \(s[0] = c_{r-1}, s[1] = c_{r-2}, ..., s[r-1] = c_{0}\).

      Delete
  2. This comment has been removed by the author.

    ReplyDelete