CS174 Spring 99, Lecture 13 Summary

CS174 Spring 99 Lecture 13 Summary

Testing equality of Strings

Testing equality of strings has a number of applications in networked computing environments. A common problem in a distributed environment is that multiple copies of documents exist in different places (e.g. in the cache of your web browser, on your PC’s local disk etc). Any one of these copies may be modified, and then the others will need to be updated. But first you must find out whether the two documents are the still the same or not. Time stamps are a partial solution, but a much more general and reliable technique is to use fingerprinting.

A typical situation is shown above. We have two documents in different parts of the network, and we would like to check if they are the same without sending an entire copy of one of them. Instead you can use a fingerprint. You have already seen one way to do this.

Polynomial method 1

Let the two strings consist of n characters each (or there isn’t much point in checking equality). Let one string a = a₁a₂…a_n, and the other b = b₁b₂…b_n. You can think of a string as a polynomial,

And similarly for b(x). Now you can check if a(x)=b(x) by choosing a random r and computing x = a(r) and y = b(r). You only need to transmit x or y across the network, and with a little more work we can make sure that they are much smaller than the strings a or b, which have n bytes.

As we have already seen, if r is chosen in the range {0,…,M-1} and if a(x) and b(x) are different, then the probability that x = y is at most n/M (n is the degree of the polynomials here). So by choosing M > 2n, we can be confident that we have a reasonable chance (better than even) of detecting a difference. And we can make the probability of not discovering a difference between a and b exponentially small by using more bits for r. But since r isn’t much larger than 2n, we need only O(log n) bits for r.

But then we run into a snag. Even though r itself has only O(log n) bits, the polynomials a(x) and b(x) include a term of the form xⁿ and when we compute the n^th power of r, we will get a number with O(n log n) bits. That’s actually longer than the original strings.

The solution is to work over a finite field (modulo a prime) instead of over the integers. That is, we do all calculations mod p, which guarantees that we never need more than O(log p) bits. To figure out how large p needs to be, notice that there are only p distinct possible values for r (mod p). So to get enough distinct choices for r to make the random choice part work, we need p > M.

So to summarize: n = length of a or b

Let M be "somewhat larger" than n, say 2n or more
Pick a prime p > M
Choose a random r from {0,…,M-1}
Compute x = a(r)(mod p) and transmit x, p, r across the network (all O(log n) bits)
At other end of network, receive x, p, r and compare x with y = b(r)(mod p)

Which will detect a difference between a and b with probability at least 1-n/M. Note that since p doesn’t need to change it does not need to be transmitted if the two parties agree on it ahead of time.

The running time depends on the time to compute a(r)(mod p). If we assume each arithmetic step has unit cost, that takes only O(n) time. But that is a little bit loose because each arithmetic step involves O(log n) bits. However, its rare to get strings with more than 2³⁰ characters, and 32 bits is enough to hold the results for smaller strings, so its not unreasonable to make that assumption in practice.

Polynomial Method 2

This method is very similar to the first method. Only instead of using a fixed p and a randomly chosen r, we use a fixed r, and a randomly chosen prime p. First of all, we want to find an r such that a(r) and b(r) are always different if a and b are different strings. Assuming each string contains 8-bit bytes, then it suffices to take r = 256 = 2⁸.

Now we compute x = a(256) (mod p) and transmit it. The recipient compares it to y = b(256) (mod p). Those two will be the same if x = y (mod p), that is, if (x – y) is divisible by p.

Now if we choose p from a "large enough set" then the probability that (x-y) is divisible by p will be low. To specify how large a set, we need to know a little more about prime numbers:

Prime Number Theorem

Let p (k) = number of distinct prime numbers less than k. Then

p (k) » k / ln k

where the approximation symbol means "asymptotically approaches", i.e. it converges to that value as k ® ¥ .

Corollary:

Let N be the product of the first m distinct primes, then m < 2 ln N / ln ln N.

Proof: Let k be the median prime factor of N. Half the factors are bigger than k, so

N > k^m/2 or (m/2) ln k < (ln N)

Now half the factors are smaller than k, so

m/2 = p (k) » k /ln k

and substituting for m/2 in the last inequality

(k/ln k)ln k < ln N or k < ln N

and then substituting for k in the formula for m/2 gives:

m < 2 ln N/ ln ln N QED

Note: the identity m < 2 ln N / ln ln N holds for any product of m distinct primes, because they and their product will be larger than for the first m primes.

Going back to the string comparison, if x – y = N, then N has at most 8n bits for n-character string comparisons. The maximum number of factors of N is 2 ln N / ln ln N, or O(n/log n).

When we choose a prime p, we will fail if p divides x – y. By the above argument, there are at most O(n/log n) bad choices for p. So we should pick p in a range {0,…,M} which contains substantially more than n/log n primes. i.e. we want

p (M) » M / ln M >> n/log n

and choosing M = W (n) will do this.

So to summarize this method: n = length of a or b

Let M be "somewhat larger" than n, say 4n or more
Pick a random prime p from {0,…,M}
Compute x = a(256)(mod p) and transmit x, p across the network (both O(log n) bits)
At other end of network, receive x, p and compare x with y = b(256)(mod p)

So to contrast the two methods: both compare a(r) (mod p) and b(r) (mod p). Method 1 uses a fixed p and a random r with O(log n) bits each. Method 2 uses a fixed r=256, and a random p with O(log n) bits. Computing the primes for method 2 is expensive, and they could be computed and saved before the algorithm begins. For n-character strings, you would need O(n/log n) primes.