Robert Krauthgamer on embedding edit distance in other metric spaces

Theory Lunch, September 28, 2005

Minutes by Kamalika Chaudhuri

Robi Krauthgamer talked about his work on lower bounds for embedding edit distances to l1.

Given two strings x and y, the edit distance ed(x,y) between them is the number of character insertions, deletions and substitutions required to convert x to y. Such a distance measure is commonly used to compare biological sequences. Given the set of strings V {0,1}n, we would like to find an embedding of V into the l1 metric. Such an embedding is said to have distortion at most D if the distance between every pair of points is preserved within a factor of D. Robi's result says that any embedding of edit distance on V to l1 incurs a distortion of at least log d.

Proving Lower Bounds on Distortion

How does one prove lower bounds on the distortion for embedding a metric into l1? Suppose we are given two probability distributions p and q. We will call them demands and edges respectively. A theorem by [LLR95] says that if edit distance embeds into l1 with distortion D, then there exists a cut (A,V \ A) such that

E p [ deltaA(x,y) ] / Eq [deltaA(x,y)] <= D E p [ ed(x,y) ] / Eq [ ed(x,y)]

Here deltaA(x,y) is 1 if the cut A separates x and y and 0 otherwise. So if one can show that all cuts (A,V\A) have a large number of edges going across, then one will have a lower bound on the distortion of embedding edit distance into l1. The rest of the talk was devoted to proving such a bound.

Choice of Demands and Edge Distributions

The demand distribution q was chosen to be the uniform distribution on VxV. [ADGIR03] shows that the expected distance between any pair of vertices in this distribution is Theta(n).

The following was selected as the edge distribution. Let pH be the uniform distribution of edges on the n dimensional hypercube, and let pS be the uniform distribution over edges of the form (x, S(x)). (Here for any string x, S(x) denotes the string obtained by applying a shift on x.) The edge distribution p is

p = 1/2 (pH + pS)

Plugging in the demand and edge values, one can see that if the distortion is smaller than O(log n), then there must exist a balanced cut (A, V\A) such that

p(A, V\A) <= O(log n/n) q(A, V\A)

Proving Lower Bounds on Expansion of Every Cut

Any cut which separates vertices of a n-dimensional hypercube can be thought of as a boolean function of n variables. Observe that the number of hypercube edges of a particular dimension i crossing such a cut is actually the influence of variable i in the corresponding function, and the total number of hypercube edges crossing a cut is the total influence of variables on the corresponding boolean function. Also note that for any cut (A, V\A), in the chosen edge distribution, the total influence is at most 2np(A,V\A).

A lemma by [KKL88] says that if the total influence of a balanced function f is at most t, then there exists a variable j with influence at least 2 -ct for some constant c. It one assumes for contradiction that

p(A, V\A) >= O(log n/n) q(A, V\A)

then this lemma implies that there exists a variable with influence 1/n1/4.

It was then shown that if one variable j has high influence, then so do many, namely, n1/8 variables. This leads to a contradiction as then the total influence becomes Omega(1/n1/8), which is more than O(log n/n). The proof for this works as follows. If the variable j has high influence, then with high probability, x and x + ej are on different sides of the cut. With at most O(log n/n) probability, x and S(x) are on the same side and with at most O(log n/n) probability, x + ej and S(x + ej) (which is also S(x) + ej+1) are on the same side of the cut. Since S(x) has the uniform distribution if x has the uniform distribution, this implies that the influence of the variable j+1 is also high. A similar argument extends to the variables ej+2, .., ej+n1/8,and therefore the total influence is high contradicting the earlier assumption.