In each set, how do the ones on the left differ from the ones on the right?

Simply Scheme: Introducing Computer Science ch 16: Example: Pattern Matcher

Simply Scheme: Introducing Computer Science 2/e Copyright (C) 1999 MIT

# Example: Pattern Matcher

 Brian HarveyUniversity of California, Berkeley Matthew WrightUniversity of California, Santa Barbara Download PDF version Back to Table of Contents BACK chapter thread NEXT MIT Press web page for Simply Scheme

It's time for another extended example in which we use the Scheme tools we've been learning to accomplish something practical. We'll start by describing how the program will work before we talk about how to implement it.

You can load our program into Scheme by typing

```(load "match.scm")
```

## Problem Description

A pattern matcher is a commonly used procedure whose job is to compare a sentence to a range of possibilities. An example may make this clear:

```> (match '(* me *) '(love me do))
#T

> (match '(* me *) '(please please me))
#T

> (match '(* me *) '(in my life))
#F
```

The first argument, `(* me *)`, is a pattern. In the pattern, each asterisk (`*`) means "any number of words, including no words at all." So the entire pattern matches any sentence that contains the word "me" anywhere within it. You can think of `match` as a more general form of `equal?` in the sense that it compares two sentences and tells us whether they're the same, but with a broader meaning of "the same."

Our pattern matcher will accept patterns more complicated than this first example. There are four special characters that indicate unspecified parts of a pattern, depending on the number of words that should be allowed:

`?` At most one word. Exactly one word. At least one word. Any number of words.

These characters are meant to be somewhat mnemonic. The question mark means "maybe there's a word." The exclamation point means "precisely one word!" (And it's vertical, just like the digit 1, sort of.) The ampersand, which ordinarily means "and," indicates that we're matching a word and maybe more. The asterisk doesn't have any mnemonic value, but it's what everyone uses for a general matching indicator anyway.

We can give a name to the collection of words that match an unspecified part of a pattern by including in the pattern a word that starts with one of the four special characters and continues with the name. If the match succeeds, `match` will return a sentence containing these names and the corresponding values from the sentence:

```> (match '(*start me *end) '(love me do))
(START LOVE ! END DO !)

> (match '(*start me *end) '(please please me))
(START PLEASE PLEASE ! END !)

> (match '(mean mr mustard) '(mean mr mustard))
()

> (match '(*start me *end) '(in my life))
FAILED
```

In these examples, you see that `match` doesn't really return `#t` or `#f`; the earlier set of examples showed a simplified picture. In the first of the new examples, the special pattern word `*start` is allowed to match any number of words, as indicated by the asterisk. In this case it turned out to match the single word "love." `Match` returns a result that tells us which words of the sentence match the named special words in the pattern. (We'll call one of these special pattern words a placeholder.) The exclamation points in the returned value are needed to separate one match from another. (In the second example, the name `end` was matched by an empty set of words.) In the third example, the match was successful, but since there were no placeholders the returned sentence was empty. If the match is unsuccessful, `match` returns the word `failed`.[1]

If the same placeholder name appears more than once in the pattern, then it must be matched by the same word(s) in the sentence each time:

```> (match '(!twice !other !twice) '(cry baby cry))
(TWICE CRY ! OTHER BABY !)

> (match '(!twice !other !twice) '(please please me))
FAILED
```

Some patterns might be matchable in more than one way. For example, the invocation

```> (match '(*front *back) '(your mother should know))
```

might return any of five different correct answers:

```(FRONT YOUR MOTHER SHOULD KNOW ! BACK !)
(FRONT YOUR MOTHER SHOULD ! BACK KNOW !)
(FRONT YOUR MOTHER ! BACK SHOULD KNOW !)
(FRONT YOUR ! BACK MOTHER SHOULD KNOW !)
(FRONT ! BACK YOUR MOTHER SHOULD KNOW !)
```

We arbitrarily decide that in such cases the first placeholder should match as many words as possible, so in this case `match` will actually return the first of these answers.

Before continuing, you might want to look at the first batch of exercises at the end of this chapter, which are about using the pattern matcher. (The rest of the exercises are about the implementation, which we'll discuss next.)

## Implementation: When Are Two Sentences Equal?

Our approach to implementation will be to start with something we already know how to write: a predicate that tests whether two sentences are exactly equal. We will add capabilities one at a time until we reach our goal.

Suppose that Scheme's primitive `equal?` function worked only for words and not for sentences. We could write an equality tester for sentences, like this:

```(define (sent-equal? sent1 sent2)
(cond ((empty? sent1)
(empty? sent2))
((empty? sent2) #f)
((equal? (first sent1) (first sent2))
(sent-equal? (bf sent1) (bf sent2)))
(else #f)))
```

Two sentences are equal if each word in the first sentence is equal to the corresponding word in the second. They're unequal if one sentence runs out of words before the other.

Why are we choosing to accept Scheme's primitive word comparison but rewrite the sentence comparison? In our pattern matcher, a placeholder in the pattern corresponds to a group of words in the sentence. There is no kind of placeholder that matches only part of a word. (It would be possible to implement such placeholders, but we've chosen not to.) Therefore, we will never need to ask whether a word is "almost equal" to another word.

## When Are Two Sentences Nearly Equal?

Pattern matching is just a more general form of this `sent-equal?` procedure. Let's write a very simple pattern matcher that knows only about the "!" special character and doesn't let us name the words that match the exclamation points in the pattern. We'll call this one `match?` with a question mark because it returns just true or false.

```(define (match? pattern sent)                ;; first version: ! only
(cond ((empty? pattern)
(empty? sent))
((empty? sent) #f)
((equal? (first pattern) '!)
(match? (bf pattern) (bf sent)))
((equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent)))
(else #f)))
```

This program is exactly the same as `sent-equal?`, except for the highlighted `cond` clause. We are still comparing each word of the pattern with the corresponding word of the sentence, but now an exclamation mark in the pattern matches any word in the sentence. (If `first` of `pattern` is an exclamation mark, we don't even look at `first` of `sent`.)

Our strategy in the next several sections will be to expand the pattern matcher by implementing the remaining special characters, then finally adding the ability to name the placeholders. For now, when we say something like "the `*` placeholder," we mean the placeholder consisting of the asterisk alone. Later, after we add named placeholders, the same procedures will implement any placeholder that begins with an asterisk.

## Matching with Alternatives

The `!` matching is not much harder than `sent-equal?`, because it's still the case that one word of the pattern must match one word of the sentence. When we introduce the `?` option, the structure of the program must be more complicated, because a question mark in the pattern might or might not be paired up with a word in the sentence. In other words, the pattern and the sentence might match without being the same length.

```(define (match? pattern sent)                ;; second version: ! and ?
(cond ((empty? pattern)
(empty? sent))
((equal? (first pattern) '?)
(if (empty? sent)
(match? (bf pattern) '())
(or (match? (bf pattern) (bf sent))
(match? (bf pattern) sent))))
((empty? sent) #f)
((equal? (first pattern) '!)
(match? (bf pattern) (bf sent)))
((equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent)))
(else #f)))
```

Note that the new `cond` clause comes before the check to see if `sent` is empty. That's because `sent` might be empty and a pattern of `(?)` would still match it. But if the sentence is empty, we know that the question mark doesn't match a word, so we just have to make sure that the `butfirst` of the pattern contains nothing but question marks. (We don't have a predicate named `all-question-marks?`; instead, we use `match?` recursively to make this test.)

In general, a question mark in the pattern has to match either one word or zero words in the sentence. How do we decide? Our rule is that each placeholder should match as many words as possible, so we prefer to match one word if we can. But allowing the question mark to match a word might prevent the rest of the pattern from matching the rest of the sentence.

Compare these two examples:

```> (match? '(? please me) '(please please me))
#T

> (match? '(? please me) '(please me))
#T
```

In the first case, the first thing in the pattern is a question mark and the first thing in the sentence is "please," and they match. That leaves "please me" in the pattern to match "please me" in the sentence.

In the second case, we again have a question mark as the first thing in the pattern and "please" as the first thing in the sentence. But this time, we had better not use up the "please" in the sentence, because that will only leave "me" to match "please me." In this case the question mark has to match no words.

To you, these examples probably look obvious. That's because you're a human being, and you can take in the entire pattern and the entire sentence all at once. Scheme isn't as smart as you are; it has to compare words one pair at a time. To Scheme, the processing of both examples begins with question mark as the first word of the pattern and "please" as the first word of the sentence. The pattern matcher has to consider both cases.

How does the procedure consider both cases? Look at the invocation of `or` by the `match?` procedure. There are two alternatives; if either turns out true, the match succeeds. One is that we try to match the question mark with the first word of the sentence just as we matched `!` in our earlier example—by making a recursive call on the `butfirst`s of the pattern and sentence. If that returns true, then the question mark matches the first word.

The second alternative that can make the match succeed is a recursive call to `match?` on the `butfirst` of the pattern and the entire sentence; this corresponds to matching the `?` against nothing.[2]

Let's trace `match?` so that you can see how these two cases are handled differently by the program.

```> (trace match?)
> (match? '(? please me) '(please please me))
```(match? (? please me) (please please me))
| (match? (please me) (please me))                Try matching ? with please.
|  |  (match? (me) (me))
|  |  |  (match? () ())
|  |  |  #t                                       It works!
|  |  #t
|  #t
#t
```#T
```

```> (match? '(? please me) '(please me))
```(match? (? please me) (please me))
| (match? (please me) (me))                        Try matching ? with please.
|  #f                                              It doesn't work.
| (match? (please me) (please me))                 This time, match ? with nothing.
|  |  (match? (me) (me))
|  |  |  (match? () ())
|  |  |  #t
|  |  #t
|  #t
#t
```#T
```

## Backtracking

The program structure that allows for two alternative routes to success has more profound implications than you may think at first.

When `match?` sees a question mark in the pattern, it has to decide whether or not to "use up" a word of the sentence by matching it with the question mark. You might wonder, "How does the question mark decide whether to take a word?" The answer is that the decision isn't made "by the question mark"; there's nothing about the particular word that the question mark might match that helps with the decision! Instead, the decision depends on matching what comes to the right of the question mark.

Compare this situation with the `keep` recursive pattern. There, too, the procedure makes a decision about the first word of a sentence, and each alternative leads to a recursive call for the `butfirst`:

```(cond ((empty? sent) '())
((some-test? (first sent))
(se (first sent) (recursive-call (bf sent))))
(else (recursive-call (bf sent))))
```

The difference is that in the `keep` pattern the choice between alternatives can be made just by looking at the immediate situation—the single word that might or might not be chosen; the decision doesn't depend on anything in the rest of the problem. As a result, the choice has already been made before any recursive call happens. Therefore, only one of the recursive calls is actually made, to make choices about the remaining words in the sentence.

In `match?`, by contrast, any particular invocation can't make its choice until it knows the result of a recursive invocation. The result from the recursive call determines the choice made by the caller.

Here's a model that might help you think about this kind of recursion. `Match?` sees a question mark in the pattern. It makes a tentative decision that this question mark should match the first word of the sentence, and it uses a recursive invocation to see whether that decision allows the rest of the problem to be solved. If so, the tentative choice was correct. If not, `match?` tries an alternative decision that the question mark doesn't match a word. This alternative is still tentative; another recursive call is needed to see if the rest of the pattern can succeed. If not, the overall match fails.

This structure is called backtracking.

What if there are two question marks in the pattern? Then there are four ways to match the overall pattern. Both question marks can match a word, or only the first question mark, or only the second, or neither. A pattern with several placeholders leads to even more alternatives. A pattern with three question marks will have eight alternatives. (All three match words, the first two do but the third doesn't, and so on.) A pattern with 10 question marks will have 1024 alternatives. How can `match?` try all these alternatives? The procedure seems to make only one two-way choice; how can it accomplish a four-way or many-way decision?

The secret is the same as the usual secret of recursion: Most of the work is done in recursive calls. We take a leap of faith that recursive invocations will take care of the decisions concerning question marks later in the pattern. Think about it using the backtracking model. Let's suppose there are 10 question marks in the pattern. When `match?` encounters the leftmost question mark, it makes a tentative decision to match the question mark with a word of the sentence. To test whether this choice can work, `match?` invokes itself recursively on a pattern with nine question marks. By the leap of faith, the recursive invocation will examine 512 ways to match question marks with words—half of the total number. If one of these 512 works, we're finished. If not, the original `match?` invocation changes its tentative choice, deciding instead not to match its question mark (the leftmost one) with a word of the sentence. Another recursive call is made based on that decision, and that recursive call checks out the remaining 512 possibilities.

By the way, the program doesn't always have to try all of the different combinations of question marks matching or not matching words separately. For example, if the problem is

```(match? '(a b ? ? ? ?) '(x y z w p q))
```

then the very first comparison discovers that `a` is different from `x`, so none of the 16 possible arrangements about question marks matching or not matching words will make a difference.

Here are some traced examples involving patterns with two question marks, to show how the result of backtracking depends on the individual problem.

```> (match? '(? ? foo) '(bar foo))
```(match? (? ? foo) (bar foo))
|  (match? (? foo) (foo))
|  |  (match? (foo) ())
|  |  #f
|  |  (match? (foo) (foo))
|  |  |  (match? () ())
|  |  |  #t
|  |  #t
|  #t
#t
```#T
```

In this first example, the first question mark tries to match the word `bar`, but it can't tell whether or not that match will succeed until the recursive call returns. In the recursive call, the second question mark tries to match the word `foo`, and fails. Then the second question mark tries again, this time matching nothing, and succeeds. Therefore, the first question mark can report success; it never has to try a recursive call in which it doesn't match a word.

In our second example, each question mark will have to try both alternatives, matching and then not matching a word, before the overall match succeeds.

```> (match? '(? ? foo bar) '(foo bar))
```(match? (? ? foo bar) (foo bar))
|  (match? (? foo bar) (bar))
|  |  (match? (foo bar) ())
|  |  #f
|  |  (match? (foo bar) (bar))
|  |  #f
|  #f
|  (match? (? foo bar) (foo bar))
|  |  (match? (foo bar) (bar))
|  |  #f
|  |  (match? (foo bar) (foo bar))
|  |  |  (match? (bar) (bar))
|  |  |  |  (match? () ())
|  |  |  |  #t
|  |  |  #t
|  |  #t
|  #t
#t
```#T
```

The first question mark tries to match the word `foo` in the sentence, leaving the pattern `(? foo bar)` to match `(bar)`. The second question mark will try both matching and not matching a word, but neither succeeds. Therefore, the first question mark tries again, this time not matching a word. The second question mark first tries matching `foo`, and when that fails, tries not matching anything. This last attempt is successful.

In the previous example, every question mark's first attempt failed. The following example illustrates the opposite case, in which every question mark's first attempt succeeds.

```> (match? '(? ? baz) '(foo bar baz))
```(match? (? ? baz) (foo bar baz))
|  (match? (? baz) (bar baz))
|  |  (match? (baz) (baz))
|  |  |  (match? () ())
|  |  |  #t
|  |  #t
|  #t
#t
```#t
```

The first question mark matches `foo`; the second matches `bar`.

If the sentence is shorter than the pattern, we may end up trying to match a pattern against an empty sentence. This is much easier than the general problem, because there aren't two alternatives; a question mark has no word in the sentence to match.

```> (match? '(? ? foo) '())
```(match? (? ? foo) ())
|  (match? (? foo) ())
|  |  (match? (foo) ())
|  |  #f
|  #f
#f
```#f
```

Each question mark knows right away that it had better not try to match a word, so we never have to backtrack.

## Matching Several Words

The next placeholder we'll implement is `*`. The order in which we're implementing these placeholders was chosen so that each new version increases the variability in the number of words a placeholder can match. The `!` placeholder was very easy because it always matches exactly one word; it's hardly different at all from a non-placeholder in the pattern. Implementing `?` was more complicated because there were two alternatives to consider. But for `*`, we might match any number of words, up to the entire rest of the sentence.

Our strategy will be a generalization of the `?` strategy: Start with a "greedy" match, and then, if a recursive call tells us that the remaining part of the sentence can't match the rest of the pattern, try a less greedy match.

The difference between `?` and `*` is that `?` allows only two possible match lengths, zero and one. Therefore, these two cases can be checked with two explicit subexpressions of an `or` expression. In the more general case of `*`, any length is possible, so we can't check every possibility separately. Instead, as in any problem of unknown size, we use recursion. First we try the longest possible match; if that fails because the rest of the pattern can't be matched, a recursive call tries the next-longest match. If we get all the way down to an empty match for the `*` and still can't match the rest of the pattern, then we return `#f`.

```(define (match? pattern sent)              ;; third version: !, ?, and *
(cond ((empty? pattern)
(empty? sent))
((equal? (first pattern) '?)
(if (empty? sent)
(match? (bf pattern) '())
(or (match? (bf pattern) (bf sent))
(match? (bf pattern) sent))))
((equal? (first pattern) '*)
(*-longest-match (bf pattern) sent))
((empty? sent) #f)
((equal? (first pattern) '!)
(match? (bf pattern) (bf sent)))
((equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent)))
(else #f)))

(define (*-longest-match pattern-rest sent)
(*-lm-helper pattern-rest sent '()))

(define (*-lm-helper pattern-rest sent-matched sent-unmatched)
(cond ((match? pattern-rest sent-unmatched) #t)
((empty? sent-matched) #f)
(else (*-lm-helper pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)))))
```

If an asterisk is found in the pattern, `match?` invokes `*-longest-match`, which carries out this backtracking approach.

The real work is done by `*-lm-helper`, which has three arguments. The first argument is the still-to-be-matched part of the pattern, following the `*` placeholder that we're trying to match now. `Sent-matched` is the part of the sentence that we're considering as a candidate to match the `*` placeholder. `Sent-unmatched` is the remainder of the sentence, following the words in `sent-matched`; it must match `pattern-rest`.

Since we're trying to find the longest possible match, `*-longest-match` chooses the entire sentence as the first attempt for `sent-matched`. Since `sent-matched` is using up the entire sentence, the initial value of `sent-unmatched` is empty. The only job of `*-longest-match` is to invoke `*-lm-helper` with these initial arguments. On each recursive invocation, `*-lm-helper` shortens `sent-matched` by one word and accordingly lengthens `sent-unmatched`.

Here's an example in which the `*` placeholder tries to match four words, then three words, and finally succeeds with two words:

```> (trace match? *-longest-match *-lm-helper)

> (match? '(* days night) '(a hard days night))
```(match? (* days night) (a hard days night))
|  (*-longest-match (days night) (a hard days night))
|  |  (*-lm-helper (days night) (a hard days night) ())
|  |  |  (match? (days night) ())
|  |  |  #f
|  |  |  (*-lm-helper (days night) (a hard days) (night))
|  |  |  |  (match? (days night) (night))
|  |  |  |  #f
|  |  |  |  (*-lm-helper (days night) (a hard) (days night))
|  |  |  |  |  (match? (days night) (days night))
|  |  |  |  |  |  (match? (night) (night))
|  |  |  |  |  |  |  (match? () ())
|  |  |  |  |  |  |  #t
|  |  |  |  |  |  #t
|  |  |  |  |  #t
|  |  |  |  #t
|  |  |  #t
|  |  #t
|  #t
#t
```#t
```

## Combining the Placeholders

We have one remaining placeholder, `&`, which is much like `*` except that it fails unless it can match at least one word. We could, therefore, write a `&-longest-match` that would be identical to `*-longest-match` except for the base case of its helper procedure. If `sent-matched` is empty, the result is `#f` even if it would be possible to match the rest of the pattern against the rest of the sentence. (All we have to do is exchange the first two clauses of the `cond`.)

```(define (&-longest-match pattern-rest sent)
(&-lm-helper pattern-rest sent '()))

(define (&-lm-helper pattern-rest sent-matched sent-unmatched)
(cond ((empty? sent-matched) #f)
((match? pattern-rest sent-unmatched) #t)
(else (&-lm-helper pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)))))
```

When two procedures are so similar, that's a clue that perhaps they could be combined into one. We could look at the bodies of these two procedures to find a way to combine them textually. But instead, let's step back and think about the meanings of the placeholders.

The reason that the procedures `*-longest-match` and `&-longest-match` are so similar is that the two placeholders have almost identical meanings. `*` means "match as many words as possible"; `&` means "match as many words as possible, but at least one." Once we're thinking in these terms, it's plausible to think of `?` as meaning "match as many words as possible, but at most one." In fact, although this is a stretch, we can also describe `!` similarly: "Match as many words as possible, but at least one, and at most one."

Placeholder  Minimum size   Maximum size
`*`0no limit
`&`1no limit
`?`01
`!`11

We'll take advantage of this newly understood similarity to simplify the program by using a single algorithm for all placeholders.

How do we generalize `*-longest-match` and `&-longest-match` to handle all four cases? There are two kinds of generalization involved. We'll write a procedure `longest-match` that will have the same arguments as `*-longest-match`, plus two others, one for for the minimum size of the matched text and one for the maximum.

We'll specify the minimum size with a formal parameter `min`. (The corresponding argument will always be 0 or 1.) `Longest-match` will pass the value of `min` down to `lm-helper`, which will use it to reject potential matches that are too short.

Unfortunately, we can't use a number to specify the maximum size, because for `*` and `&` there is no maximum. Instead, `longest-match` has a formal parameter `max-one?` whose value is `#t` only for `?` and `!`.

Our earlier, special-case versions of `longest-match` were written for `*` and `&`, the placeholders for which `max-one?` will be false. For those placeholders, the new `longest-match` will be just like the earlier versions. Our next task is to generalize `longest-match` so that it can handle the `#t` cases.

Think about the meaning of the `sent-matched` and `sent-unmatched` parameters in the `lm-helper` procedures. `Sent-matched` means "the longest part of the sentence that this placeholder is still allowed to match," while `sent-unmatched` contains whatever portion of the sentence has already been disqualified from being matched by the placeholder.

Consider the behavior of `*-longest-match` when an asterisk is at the beginning of a pattern that we're trying to match against a seven-word sentence. Initially, `sent-matched` is the entire seven-word sentence, and `sent-unmatched` is empty. Then, supposing that doesn't work, `sent-matched` is a six-word sentence, while `sent-unmatched` contains the remaining word. This continues as long as no match succeeds until, near the end of `longest-match`'s job, `sent-matched` is a one-word sentence and `sent-unmatched` contains six words. At this point, the longest possible match for the asterisk is a single word.

This situation is where we want to start in the case of the `?` and `!` placeholders. So when we're trying to match one of these placeholders, our initialization procedure won't use the entire sentence as the initial value of `sent-matched`; rather, the initial value of `sent-matched` will be a one-word sentence, and `sent-unmatched` will contain the rest of the sentence.

```(define (longest-match pattern-rest sent min max-one?)  ;; first version
(cond ((empty? sent)
(and (= min 0) (match? pattern-rest sent)))
(max-one?
(lm-helper pattern-rest (se (first sent)) (bf sent) min))
(else (lm-helper pattern-rest sent '() min))))

(define (lm-helper pattern-rest sent-matched sent-unmatched min)
(cond ((< (length sent-matched) min) #f)
((match? pattern-rest sent-unmatched) #t)
((empty? sent-matched) #f)
(else (lm-helper pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)
min))))
```

Now we can rewrite `match?` to use `longest-match`. `Match?` will delegate the handling of all placeholders to a subprocedure `match-special` that will invoke `longest-match` with the correct values for `min` and `max-one?` according to the table.

```(define (match? pattern sent)                          ;; fourth version
(cond ((empty? pattern)
(empty? sent))
((special? (first pattern))
(match-special (first pattern) (bf pattern) sent))
((empty? sent) #f)
((equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent)))
(else #f)))

(define (special? wd)                                  ;; first version
(member? wd '(* & ? !)))

(define (match-special placeholder pattern-rest sent)  ;; first version
(cond ((equal? placeholder '?)
(longest-match pattern-rest sent 0 #t))
((equal? placeholder '!)
(longest-match pattern-rest sent 1 #t))
((equal? placeholder '*)
(longest-match pattern-rest sent 0 #f))
((equal? placeholder '&)
(longest-match pattern-rest sent 1 #f))))
```

## Naming the Matched Text

So far we've worked out how to match the four kinds of placeholders and return a true or false value indicating whether a match is possible. Our program is almost finished; all we need to make it useful is the facility that will let us find out which words in the sentence matched each placeholder in the pattern.

We don't have to change the overall structure of the program in order to make this work. But most of the procedures in the pattern matcher will have to be given an additional argument, the database of placeholder names and values that have been matched so far.[3] The formal parameter `known-values` will hold this database. Its value will be a sentence containing placeholder names followed by the corresponding words and an exclamation point to separate the entries, as in the examples earlier in the chapter. When we begin the search for a match, we use an empty sentence as the initial `known-values`:

```(define (match pattern sent)
(match-using-known-values pattern sent '()))

(define (match-using-known-values pattern sent known-values)
…)
```

As `match-using-known-values` matches the beginning of a pattern with the beginning of a sentence, it invokes itself recursively with an expanded `known-values` containing each newly matched placeholder. For example, in evaluating

```(match '(!twice !other !twice) '(cry baby cry))
```

the program will call `match-using-known-values` four times:

`pattern``sent` `known-values`
`(!twice !other !twice)  ` `(cry baby cry)  ` `()`
`(!other !twice)` `(baby cry)` `(twice cry !)`
`(!twice)` `(cry)` `(twice cry ! other baby !)`
`()` `()` `(twice cry ! other baby !)`

In the first invocation, we try to match `!twice` against some part of the sentence. Since `!` matches exactly one word, the only possibility is to match the word `cry`. The recursive invocation, therefore, is made with the first words of the pattern and sentence removed, but with the match between `twice` and `cry` added to the database.

Similarly, the second invocation matches `!other` with `baby` and causes a third invocation with shortened pattern and sentence but a longer database.

The third invocation is a little different because the pattern contains the placeholder `!twice`, but the name `twice` is already in the database. Therefore, this placeholder can't match whatever word happens to be available; it must match the same word that it matched before. (Our program will have to check for this situation.) Luckily, the sentence does indeed contain the word `cry` at this position.

The final invocation reaches the base case of the recursion, because the pattern is empty. The value that `match-using-known-values` returns is the database in this invocation.

## The Final Version

We're now ready to show you the final version of the program. The program structure is much like what you've seen before; the main difference is the database of placeholder names and values. The program must add entries to this database and must look for database entries that were added earlier. Here are the three most important procedures and how they are changed from the earlier version to implement this capability:

• `match-using-known-values`, essentially the same as what was formerly named `match?` except for bookkeeping details.
• `match-special`, similar to the old version, except that it must recognize the case of a placeholder whose name has already been seen. In this case, the placeholder can match only the same words that it matched before.
• `longest-match` and `lm-helper`, also similar to the old versions, except that they have the additional job of adding to the database the name and value of any placeholder that they match.

Here are the modified procedures. Compare them to the previous versions.

```(define (match pattern sent)
(match-using-known-values pattern sent '()))

(define (match-using-known-values pattern sent known-values)
(cond ((empty? pattern)
(if (empty? sent) known-values 'failed))
((special? (first pattern))
(let ((placeholder (first pattern)))
(match-special (first placeholder)
(bf placeholder)
(bf pattern)
sent
known-values)))
((empty? sent) 'failed)
((equal? (first pattern) (first sent))
(match-using-known-values (bf pattern) (bf sent) known-values))
(else 'failed)))

(define (match-special howmany name pattern-rest sent known-values)
(let ((old-value (lookup name known-values)))
(cond ((not (equal? old-value 'no-value))
(if (length-ok? old-value howmany)
(already-known-match
old-value pattern-rest sent known-values)
'failed))
((equal? howmany '?)
(longest-match name pattern-rest sent 0 #t known-values))
((equal? howmany '!)
(longest-match name pattern-rest sent 1 #t known-values))
((equal? howmany '*)
(longest-match name pattern-rest sent 0 #f known-values))
((equal? howmany '&)
(longest-match name pattern-rest sent 1 #f known-values)))))

(define (longest-match name pattern-rest sent min max-one? known-values)
(cond ((empty? sent)
(if (= min 0)
(match-using-known-values pattern-rest
sent
(add name '() known-values))
'failed))
(max-one?
(lm-helper name pattern-rest (se (first sent))
(bf sent) min known-values))
(else (lm-helper name pattern-rest
sent '() min known-values))))

(define (lm-helper name pattern-rest
sent-matched sent-unmatched min known-values)
(if (< (length sent-matched) min)
'failed
(let ((tentative-result (match-using-known-values
pattern-rest
sent-unmatched
(add name sent-matched known-values))))
(cond ((not (equal? tentative-result 'failed)) tentative-result)
((empty? sent-matched) 'failed)
(else (lm-helper name
pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)
min
known-values))))))
```

We haven't listed all of the minor procedures that these procedures invoke. A complete listing is at the end of the chapter, but we hope that you have enough confidence about the overall program structure to be able to assume these small details will work. In the next few paragraphs we discuss some of the ways in which the procedures shown here differ from the earlier versions.

In the invocation of `match-special` we found it convenient to split the placeholder into its first character, the one that tells how many words can be matched, and the butfirst, which is the name of the placeholder.

What happens if `match-special` finds that the name is already in the database? In this situation, we don't have to try multiple possibilities for the number of words to match (the usual job of `longest-match`); the placeholder must match exactly the words that it matched before. In this situation, three things must be true in order for the match to succeed: (1) The first words of the `sent` argument must match the old value stored in the database. (2) The partial `pattern` that remains after this placeholder must match the rest of the `sent`. (3) The old value must be consistent with the number of words permitted by the `howmany` part of the placeholder. For example, if the pattern is

```(*stuff and !stuff)
```

and the database says that the placeholder `*stuff` was matched by three words from the sentence, then the second placeholder `!stuff` can't possibly be matched because it accepts only one word. This third condition is actually checked first, by `length-ok?`, and if we pass that hurdle, the other two conditions are checked by `already-known-match`.

The only significant change to `longest-match` is that it invokes `add` to compute an expanded database with the newly found match added, and it uses the resulting database as an argument to `match-using-known-values`.

## Abstract Data Types

As you know, a database of known values is represented in this program as a sentence in which the entries are separated by exclamation points. Where is this representation accomplished in the program you've seen? There's nothing like

`…(sentence old-known-values name value '!) …`

anywhere in the procedures we've shown. Instead, the program makes reference to the database of known values through two procedure calls:

```(lookup name known-values)                   ; in match-special
(add name matched known-values)              ; in longest-match
```

Only the procedures `lookup` and `add` manipulate the database of known values:

```(define (lookup name known-values)
(cond ((empty? known-values) 'no-value)
((equal? (first known-values) name)
(get-value (bf known-values)))
(else (lookup name (skip-value known-values)))))

(define (get-value stuff)
(if (equal? (first stuff) '!)
'()
(se (first stuff) (get-value (bf stuff)))))

(define (skip-value stuff)
(if (equal? (first stuff) '!)
(bf stuff)
(skip-value (bf stuff))))

(define (add name value known-values)
(if (empty? name)
known-values
(se known-values name value '!)))
```

These procedures are full of small details. For example, it's a little tricky to extract the part of a sentence from a name to the next exclamation point. It's convenient that we could write the more important procedures, such as `longest-match`, without filling them with these details. As far as `longest-match` knows, `lookup` and `add` could be Scheme primitive procedures. In effect we've created a new data type, with `add` as its constructor and `lookup` as its selector.

Types such as these, that are invented by a programmer and aren't part of the Scheme language itself, are called abstract data types. Creating an abstract data type means drawing a barrier between an idea about some kind of information we want to model in a program and the particular mechanism that we use to represent the information. In this case, the information is a collection of name-value associations, and the particular mechanism is a sentence with exclamation points and so on. The pattern matcher doesn't think of the database as a sentence. For example, it would be silly to translate the database into Pig Latin or find its acronym.

Just as we distinguish the primitive procedures that Scheme knows all along from the compound procedures that the Scheme programmer defines, we could use the names "primitive data type" for types such as numbers and Booleans that are built into Scheme and "compound data type" for ones that the programmer invents by defining selectors and constructors. But "compound data type" is a bit of a pun, because it also suggests a data type built out of smaller pieces, just as a compound expression is built of smaller expressions. Perhaps that's why the name "abstract data type" has become generally accepted. It's connected to the idea of abstraction that we introduced earlier, because in order to create an abstract data type, we must specify the selectors and constructors and give names to those patterns of computation.

## Backtracking and `Known-Values`

What happens to the database in cases that require backtracking, where a particular recursive call might be "on the wrong track"? Let's trace `match-using-known-values` and see what happens. (We'll use the little-people model to discuss this example, and so we're annotating each invocation in the trace with the name of its little person.)

```> (trace match-using-known-values)
> (match '(*start me *end) '(love me do))```
(match-using-known-values (*start me *end) (love me do) ())          Martha
|  (match-using-known-values (me *end) () (start love me do !))      Mercutio
|  failed
|  (match-using-known-values (me *end) (do) (start love me !))       Masayuki
|  failed
|  (match-using-known-values (me *end) (me do) (start love !))       Mohammad
|  |  (match-using-known-values (*end) (do) (start love !))          Mae
|  |  |  (match-using-known-values () () (start love ! end do !))    Merlin
|  |  |  (start love ! end do !)
|  |  (start love ! end do !)
|  (start love ! end do !)
(start love ! end do !)
```(START LOVE ! END DO !)
```

Martha, the first little person shown, has an empty `known-values`. She makes three attempts to match `*start` with parts of the sentence. In each case, a little person is hired with the provisional match in his or her `known-values`. (Actually, Martha does not directly hire Mercutio and the others. Martha hires a `match-special` little person, who in turn hires a `longest-match` specialist, who hires an `lm-helper` specialist, who hires Mercutio. But that added complexity isn't important for the point we're focusing on right now, namely, how backtracking can work. Pretend Martha hires Mercutio.)

If you don't use the little-people model, but instead think about the program as if there were just one `known-values` variable, then the backtracking can indeed be very mysterious. Once a provisional match is added to the database, how is it ever removed? The answer is that it doesn't work that way. There isn't a "the" database. Instead, each little person has a separate database. If an attempted match fails, the little person who reports the failure just stops working. For example, Martha hires Mercutio to attempt a match in which the name `start` has the value ```love me do```. Mercutio is unable to complete the match, and reports failure. It is Martha, not Mercutio, who then hires Masayuki to try another value for `start`. Martha's database hasn't changed, so Martha gives Masayuki a database that reflects the new trial value but not the old one.

Not every hiring of a little person starts from an empty database. When a match is partially successful, the continuation of the same attempt must benefit from the work that's already been done. So, for example, when Mohammad hires Mae, and when Mae hires Merlin, each of them passes on an extended database, not an empty one. Specifically, Mae gives Merlin the new match of the name `end` with the value `do`, but also the match of `start` with `love` that she was given by Mohammad.

So as you can see, we don't have to do anything special to keep track of our database when we backtrack; the structure of the recursion takes care of everything for free.

## How We Wrote It

For explanatory purposes we've chosen to present the pieces of this program in a different order from the one in which we actually wrote them. We did implement the easy placeholders (`!` and `?`) before the harder ones. But our program had provision for a database of names from the beginning.

There is no "right" way to approach a programming problem. Our particular approach was determined partly by our past experience. Each of us had written similar programs before, and we had preconceived ideas about the easy and hard parts. You might well start at a different point. For example, here is an elegant small program we'd both been shown by friends:

```(define (match? pattern sent)
(cond ((empty? pattern) (empty? sent))
((empty? sent)
(and (equal? (first pattern) '*) (match? (bf pattern) sent)))
((equal? (first pattern) '*)
(or (match? pattern (bf sent))
(match? (bf pattern) sent)))
(else (and (equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent))))))
```

What's appealing about this is the funny symmetry of taking the `butfirst` of the pattern or of the sentence. That's not something you'd naturally think of, probably, but once you've worked out how it can work, it affects your preconceptions when you set out to write a pattern matcher yourself.

Based on that inspiration, we might well have started with the hard cases (such as `*`), with the idea that once they're in place, the easy cases won't change the program structure much.

## Complete Program Listing

```(define (match pattern sent)
(match-using-known-values pattern sent '()))

(define (match-using-known-values pattern sent known-values)
(cond ((empty? pattern)
(if (empty? sent) known-values 'failed))
((special? (first pattern))
(let ((placeholder (first pattern)))
(match-special (first placeholder)
(bf placeholder)
(bf pattern)
sent
known-values)))
((empty? sent) 'failed)
((equal? (first pattern) (first sent))
(match-using-known-values (bf pattern) (bf sent) known-values))
(else 'failed)))

(define (special? wd)
(member? (first wd) '(* & ? !)))

(define (match-special howmany name pattern-rest sent known-values)
(let ((old-value (lookup name known-values)))
(cond ((not (equal? old-value 'no-value))
(if (length-ok? old-value howmany)
(already-known-match
old-value pattern-rest sent known-values)
'failed))
((equal? howmany '?)
(longest-match name pattern-rest sent 0 #t known-values))
((equal? howmany '!)
(longest-match name pattern-rest sent 1 #t known-values))
((equal? howmany '*)
(longest-match name pattern-rest sent 0 #f known-values))
((equal? howmany '&)
(longest-match name pattern-rest sent 1 #f known-values)))))

(define (length-ok? value howmany)
(cond ((empty? value) (member? howmany '(? *)))
((not (empty? (bf value))) (member? howmany '(* &)))
(else #t)))

(define (already-known-match value pattern-rest sent known-values)
(let ((unmatched (chop-leading-substring value sent)))
(if (not (equal? unmatched 'failed))
(match-using-known-values pattern-rest unmatched known-values)
'failed)))

(define (chop-leading-substring value sent)
(cond ((empty? value) sent)
((empty? sent) 'failed)
((equal? (first value) (first sent))
(chop-leading-substring (bf value) (bf sent)))
(else 'failed)))

(define (longest-match name pattern-rest sent min max-one? known-values)
(cond ((empty? sent)
(if (= min 0)
(match-using-known-values pattern-rest
sent
(add name '() known-values))
'failed))
(max-one?
(lm-helper name pattern-rest (se (first sent))
(bf sent) min known-values))
(else (lm-helper name pattern-rest
sent '() min known-values))))

(define (lm-helper name pattern-rest
sent-matched sent-unmatched min known-values)
(if (< (length sent-matched) min)
'failed
(let ((tentative-result (match-using-known-values
pattern-rest
sent-unmatched
(add name sent-matched known-values))))
(cond ((not (equal? tentative-result 'failed)) tentative-result)
((empty? sent-matched) 'failed)
(else (lm-helper name
pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)
min
known-values))))))

;;; Known values database abstract data type

(define (lookup name known-values)
(cond ((empty? known-values) 'no-value)
((equal? (first known-values) name)
(get-value (bf known-values)))
(else (lookup name (skip-value known-values)))))

(define (get-value stuff)
(if (equal? (first stuff) '!)
'()
(se (first stuff) (get-value (bf stuff)))))

(define (skip-value stuff)
(if (equal? (first stuff) '!)
(bf stuff)
(skip-value (bf stuff))))

(define (add name value known-values)
(if (empty? name)
known-values
(se known-values name value '!)))
```

## Exercises about Using the Pattern Matcher

16.1  Design and test a pattern that matches any sentence containing the word `C` three times (not necessarily next to each other).

16.2  Design and test a pattern that matches a sentence consisting of two copies of a smaller sentence, such as `(a b a b)`.

16.3  Design and test a pattern that matches any sentence of no more than three words.

16.4  Design and test a pattern that matches any sentence of at least three words.

16.5  Show sentences of length 2, 3, and 4 that match the pattern

```(*x *y *y *x)
```

For each length, if no sentence can match the pattern, explain why not.

16.6  Show sentences of length 2, 3, and 4 that match the pattern

```(*x *y &y &x)
```

For each length, if no sentence can match the pattern, explain why not.

16.7  List all the sentences of length 6 or less, starting with `a b a`, that match the pattern

```(*x *y *y *x)
```

## Exercises about Implementation

16.8  Explain how `longest-match` handles an empty sentence.

16.9  Suppose the first `cond` clause in `match-using-known-values` were

```((empty? pattern) known-values)
```

Give an example of a pattern and sentence for which the modified program would give a different result from the original.

16.10  What happens if the sentence argument—not the pattern—contains the word `*` somewhere?

16.11  For each of the following examples, how many `match-using-known-values` little people are required?

```(match '(from me to you) '(from me to you))
(match '(*x *y *x) '(a b c a b))
(match '(*x *y *z) '(a b c a b))
(match '(*x hey *y bulldog *z) '(a hey b bulldog c))
(match '(*x a b c d e f) '(a b c d e f))
(match '(a b c d e f *x) '(a b c d e f))
```

In general, what can you say about the characteristics that make a pattern easy or hard to match?

16.12  Show a pattern with the following two properties: (1) It has at least two placeholders. (2) When you match it against any sentence, every invocation of `lookup` returns `no-value`.

16.13  Show a pattern and a sentence that can be used as arguments to `match` so that `lookup` returns `(the beatles)` at some point during the match.

16.14  Our program can still match patterns with unnamed placeholders. How would it affect the operation of the program if these unnamed placeholders were added to the database? What part of the program keeps them from being added?

16.15  Why don't `get-value` and `skip-value` check for an empty argument as the base case?

16.16  Why didn't we write the first `cond` clause in `length-ok?` as the following?

```((and (empty? value) (member? howmany '(? *))) #t)
```

16.17  Where in the program is the initial empty database of known values established?

16.18  For the case of matching a placeholder name that's already been matched in this pattern, we said on page there that three conditions must be checked. For each of the three, give a pattern and sentence that the program would incorrectly match if the condition were not checked.

16.19  What will the following example do?

```(match '(?x is *y !x) '(! is an exclamation point !))
```

Can you suggest a way to fix this problem?

16.20  Modify the pattern matcher so that a placeholder of the form `*15x` is like `*x` except that it can be matched only by exactly 15 words.

```> (match '(*3front *back) '(your mother should know))
(FRONT YOUR MOTHER SHOULD ! BACK KNOW !)
```

16.21  Modify the pattern matcher so that a `+` placeholder (with or without a name attached) matches only a number:

```> (match '(*front +middle *back) '(four score and 7 years ago))
(FRONT FOUR SCORE AND ! MIDDLE 7 ! BACK YEARS AGO !)
```

The `+` placeholder is otherwise like `!`—it must match exactly one word.

16.22  Does your favorite text editor or word processor have a search command that allows you to search for patterns rather than only specific strings of characters? Look into this and compare your editor's capabilities with that of our pattern matcher.

[1] Why not return the sentence if successful or `#f` otherwise? That would be fine in most versions of Scheme, but as we mentioned earlier, the empty sentence `()` is the same as the false value `#f` in some dialects. In those Schemes, a successfully matched pattern with no named placeholders, for which the program should return an empty sentence, would be indistinguishable from an unmatched pattern.

[2] Actually, since `or` is a special form, Scheme avoids the need to try the second alternative if the first one succeeds.

[3] The word database has two possible meanings in computer science, a broad meaning and a narrow one. The broad meaning, which we're using here, is a repository of information to which the program periodically adds new items for later retrieval. The narrow meaning is a collection of information that's manipulated by a database program, which provides facilities for adding new information, modifying existing entries, selecting entries that match some specified criterion, and so on. We'll see a database program near the end of the book.

(back to Table of Contents)

BACK chapter thread NEXT

Brian Harvey, `bh@cs.berkeley.edu`