Testing equality of strings has a number of applications in networked computing environments. A common problem in a distributed environment is that multiple copies of documents exist in different places (e.g. in the cache of your web browser, on your PCs local disk etc). Any one of these copies may be modified, and then the others will need to be updated. But first you must find out whether the two documents are the still the same or not. Time stamps are a partial solution, but a much more general and reliable technique is to use fingerprinting.

A typical situation is shown above. We have two documents in different parts of the network, and we would like to check if they are the same without sending an entire copy of one of them. Instead you can use a fingerprint. You have already seen one way to do this.
Let the two strings consist of n characters each (or there isnt much point in checking equality). Let one string a = a1a2 an, and the other b = b1b2 bn. You can think of a string as a polynomial,
![]()
And similarly for b(x). Now you can check if a(x)=b(x) by choosing a random r and computing x = a(r) and y = b(r). You only need to transmit x or y across the network, and with a little more work we can make sure that they are much smaller than the strings a or b, which have n bytes.
As we have already seen, if r is chosen in the range {0, ,M-1} and if a(x) and b(x) are different, then the probability that x = y is at most n/M (n is the degree of the polynomials here). So by choosing M > 2n, we can be confident that we have a reasonable chance (better than even) of detecting a difference. And we can make the probability of not discovering a difference between a and b exponentially small by using more bits for r. But since r isnt much larger than 2n, we need only O(log n) bits for r.
But then we run into a snag. Even though r itself has only O(log n) bits, the polynomials a(x) and b(x) include a term of the form xn and when we compute the nth power of r, we will get a number with O(n log n) bits. Thats actually longer than the original strings.
The solution is to work over a finite field (modulo a prime) instead of over the integers. That is, we do all calculations mod p, which guarantees that we never need more than O(log p) bits. To figure out how large p needs to be, notice that there are only p distinct possible values for r (mod p). So to get enough distinct choices for r to make the random choice part work, we need p > M.
So to summarize: n = length of a or b
Which will detect a difference between a and b with probability at least 1-n/M. Note that since p doesnt need to change it does not need to be transmitted if the two parties agree on it ahead of time.
The running time depends on the time to compute a(r)(mod p). If we assume each arithmetic step has unit cost, that takes only O(n) time. But that is a little bit loose because each arithmetic step involves O(log n) bits. However, its rare to get strings with more than 230 characters, and 32 bits is enough to hold the results for smaller strings, so its not unreasonable to make that assumption in practice.
This method is very similar to the first method. Only instead of using a fixed p and a randomly chosen r, we use a fixed r, and a randomly chosen prime p. First of all, we want to find an r such that a(r) and b(r) are always different if a and b are different strings. Assuming each string contains 8-bit bytes, then it suffices to take r = 256 = 28.
Now we compute x = a(256) (mod p) and transmit it. The recipient compares it to y = b(256) (mod p). Those two will be the same if x = y (mod p), that is, if (x y) is divisible by p.
Now if we choose p from a "large enough set" then the probability that (x-y) is divisible by p will be low. To specify how large a set, we need to know a little more about prime numbers:
Let p (k) = number of distinct prime numbers less than k. Then
p
(k) » k / ln kwhere the approximation symbol means "asymptotically approaches", i.e. it converges to that value as k ® ¥ .
Corollary:
Let N be the product of the first m distinct primes, then m < 2 ln N / ln ln N.
Proof: Let k be the median prime factor of N. Half the factors are bigger than k, so
N > km/2 or (m/2) ln k < (ln N)
Now half the factors are smaller than k, so
m/2 = p (k) » k /ln k
and substituting for m/2 in the last inequality
(k/ln k)ln k < ln N or k < ln N
and then substituting for k in the formula for m/2 gives:
m < 2 ln N/ ln ln N QED
Note: the identity m < 2 ln N / ln ln N holds for any product of m distinct primes, because they and their product will be larger than for the first m primes.
Going back to the string comparison, if x y = N, then N has at most 8n bits for n-character string comparisons. The maximum number of factors of N is 2 ln N / ln ln N, or O(n/log n).
When we choose a prime p, we will fail if p divides x y. By the above argument, there are at most O(n/log n) bad choices for p. So we should pick p in a range {0, ,M} which contains substantially more than n/log n primes. i.e. we want
p
(M) » M / ln M >> n/log nand choosing M = W (n) will do this.
So to summarize this method: n = length of a or b
So to contrast the two methods: both compare a(r) (mod p) and b(r) (mod p). Method 1 uses a fixed p and a random r with O(log n) bits each. Method 2 uses a fixed r=256, and a random p with O(log n) bits. Computing the primes for method 2 is expensive, and they could be computed and saved before the algorithm begins. For n-character strings, you would need O(n/log n) primes.