CHAPTER SIXTEEN: PATTERN MATCHING (Part 7)

The Art of ASSEMBLY LANGUAGE PROGRAMMING

Chapter Sixteen (Part 6)	Table of Content	Chapter Sixteen (Part 8)

CHAPTER SIXTEEN:
PATTERN MATCHING (Part 7)

16.6 - Semantic Rules and Actions
16.7 - Constructing Patterns for the MATCH Routine

16.6 Semantic Rules and Actions

Automata theory is mainly concerned with whether or not a string matches a given pattern. Like many theoretical sciences, practitioners of automata theory are only concerned if something is possible, the practical applications are not as important. For real programs, however, we would like to perform certain operations if we match a string or perform one from a set of operations depending on how we match the string.

A semantic rule or semantic action is an operation you perform based upon the type of pattern you match. This is, it is the piece of code you execute when you are satisfied with some pattern matching behavior. For example, the call to patgrab in the previous section is an example of a semantic action.

Normally, you execute the code associated with a semantic rule after returning from the call to match. Certainly when processing regular expressions, there is no need to process a semantic action in the middle of pattern matching operation. However, this isn't the case for a context free grammar. Context free grammars often involve recursion or may use the same pattern several times when matching a single string (that is, you may reference the same nonterminal several times while matching the pattern). The pattern matching data structure only maintains pointers (EndPattern, StartPattern, and StrSeg) to the last substring matched by a given pattern. Therefore, if you reuse a subpattern while matching a string and you need to execute a semantic rule associated with that subpattern, you will need to execute that semantic rule in the middle of the pattern matching operation, before you reference that subpattern again.

It turns out to be very easy to insert semantic rules in the middle of a pattern matching operation. All you need to do is write a pattern matching function that always succeeds (i.e., it returns with the carry flag clear). Within the body of your pattern matching routine you can choose to ignore the string the matching code is testing and perform any other actions you desire.

Your semantic action routine, on return, must set the carry flag and it must copy the original contents of di into ax. It must preserve all other registers. Your semantic action must not call the match routine (call sl_match2 instead). Match does not allow recursion (it is not reentrant) and calling match within a semantic action routine will mess up the pattern match in progress.

The following example provides several examples of semantic action routines within a program. This program converts arithmetic expressions in infix (algebraic) form to reverse polish notation (RPN) form.

; INFIX.ASM
;
; A simple program which demonstrates the pattern matching routines in the
; UCR library. This program accepts an arithmetic expression on the command
; line (no interleaving spaces in the expression is allowed, that is, there
; must be only one command line parameter) and converts it from infix notation
; to postfix (rpn) notation.

                .xlist
                include         stdlib.a
                includelib      stdlib.lib
                matchfuncs
                .list


dseg            segment para public 'data'

; Grammar for simple infix -> postfix translation operation
; (the semantic actions are enclosed in braces}:
;
; E -> FE'
; E' -> +F {output '+'} E' | -F {output '-'} E' | <empty string>
; F -> TF'
; F -> *T {output '*'} F' | /T {output '/'} F' | <empty string>
; T -> -T {output 'neg'} | S
; S -> <constant> {output constant} | (E)
;
; UCR Standard Library Pattern which handles the grammar above:

; An expression consists of an "E" item followed by the end of the string:

infix2rpn               pattern {sl_Match2,E,,EndOfString}
EndOfString             pattern {EOS}

; An "E" item consists of an "F" item optionally followed by "+" or "-"
; and another "E" item:

E               pattern {sl_Match2, F,,Eprime}
Eprime          pattern {MatchChar, '+', Eprime2, epf}
epf             pattern {sl_Match2, F,,epPlus}
epPlus          pattern {OutputPlus,,,Eprime}           ;Semantic rule

Eprime2         pattern {MatchChar, '-', Succeed, emf}
emf             pattern {sl_Match2, F,,epMinus}
epMinus         pattern {OutputMinus,,,Eprime}          ;Semantic rule

; An "F" item consists of a "T" item optionally followed by "*" or "/"
; followed by another "T" item:

F               pattern {sl_Match2, T,,Fprime}
Fprime          pattern {MatchChar, '*', Fprime2, fmf}
fmf             pattern {sl_Match2, T, 0, pMul}
pMul            pattern {OutputMul,,,Fprime}            ;Semantic rule

Fprime2         pattern {MatchChar, '/', Succeed, fdf}
fdf             pattern {sl_Match2, T, 0, pDiv}
pDiv            pattern {OutputDiv, 0, 0,Fprime}        ;Semantic rule

; T item consists of an "S" item or a "-" followed by another "T" item:

T               pattern {MatchChar, '-', S, TT}
TT              pattern {sl_Match2, T, 0,tpn}
tpn             pattern {OutputNeg}                     ;Semantic rule

; An "S" item is either a string of one or more digits or "(" followed by
; and "E" item followed by ")":

Const           pattern {sl_Match2, DoDigits, 0, spd}
spd             pattern {OutputDigits}                  ;Semantic rule

DoDigits        pattern {Anycset, Digits, 0, SpanDigits}
SpanDigits      pattern {Spancset, Digits}

S               pattern {MatchChar, '(', Const, IntE}
IntE            pattern {sl_Match2, E, 0, CloseParen}
CloseParen      pattern {MatchChar, ')'}


Succeed         pattern {DoSucceed}


                include stdsets.a

dseg            ends



cseg            segment para public 'code'
                assume  cs:cseg, ds:dseg

; DoSucceed matches the empty string. In other words, it matches anything
; and always returns success without eating any characters from the input
; string.

DoSucceed       proc    far
                mov     ax, di
                stc
                ret
DoSucceed       endp


; OutputPlus is a semantic rule which outputs the "+" operator after the
; parser sees a valid addition operator in the infix string.

OutputPlus      proc    far
                print
                byte    " +",0
                mov     ax, di                  ;Required by sl_Match
                stc
                ret
OutputPlus      endp


; OutputMinus is a semantic rule which outputs the "-" operator after the
; parser sees a valid subtraction operator in the infix string.

OutputMinus     proc    far
                print
                byte    " -",0
                mov     ax, di                  ;Required by sl_Match
                stc
                ret
OutputMinus     endp


; OutputMul is a semantic rule which outputs the "*" operator after the
; parser sees a valid multiplication operator in the infix string.

OutputMul       proc    far
                print
                byte    " *",0
                mov     ax, di                  ;Required by sl_Match
                stc
                ret
OutputMul       endp


; OutputDiv is a semantic rule which outputs the "/" operator after the
; parser sees a valid division operator in the infix string.

OutputDiv       proc    far
                print
                byte    " /",0
                mov     ax, di                  ;Required by sl_Match
                stc
                ret
OutputDiv       endp


; OutputNeg is a semantic rule which outputs the unary "-" operator after the
; parser sees a valid negation operator in the infix string.

OutputNeg       proc    far
                print
                byte    " neg",0
                mov     ax, di                  ;Required by sl_Match
                stc
                ret
OutputNeg       endp


; OutputDigits outputs the numeric value when it encounters a legal integer
; value in the input string.

OutputDigits    proc    far
                push    es
                push    di
                mov     al, ' '
                putc
                lesi    const
                patgrab
                puts
                free
                stc
                pop     di
                mov     ax, di
                pop     es
                ret
OutputDigits    endp



; Okay, here's the main program which fetches the command line parameter
; and parses it.

Main            proc
                mov     ax, dseg
                mov     ds, ax
                mov     es, ax

                meminit                         ; memory to the heap.


                print
                byte    "Enter an arithmetic expression: ",0
                getsm
                print
                byte    "Expression in postfix form: ",0

                ldxi    infix2rpn
                xor     cx, cx
                match
                jc      Succeeded

                print
                byte    "Syntax error",0

Succeeded:      putcr

Quit:           ExitPgm
Main            endp

cseg            ends



; Allocate a reasonable amount of space for the stack (8k).

sseg            segment para stack 'stack'
stk             db      1024 dup ("stack ")
sseg            ends


; zzzzzzseg must be the last segment that gets loaded into memory!

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       db      16 dup (?)
zzzzzzseg       ends
                end     Main

16.7 Constructing Patterns for the MATCH Routine

A major issue we have yet to discuss is how to convert regular expressions and context free grammars into patterns suitable for the UCR Standard Library pattern matching routines. Most of the examples appearing up to this point have used an ad hoc translation scheme; now it is time to provide an algorithm to accomplish this.

The following algorithm converts a context free grammar to a UCR Standard Library pattern data structure. If you want to convert a regular expression to a pattern, first convert the regular expression to a context free grammar. Of course, it is easy to convert many regular expression forms directly to a pattern, when such conversions are obvious you can bypass the following algorithm; for example, it should be obvious that you can use spancset to match a regular expression like [0-9]*.

The first step you must always take is to eliminate left recursion from the grammar. You will generate an infinite loop (and crash the machine) if you attempt to code a grammar containing left recursion into a pattern data structure. You might also want to left factor the grammar while you are eliminating left recursion. The Standard Library routines fully support backtracking, so left factoring is not strictly necessary, however, the matching routine will execute faster if it does not need to backtrack.

If a grammar production takes the form A B C where A, B, and C are nonterminal symbols, you would create the following pattern:

A		pattern	{sl_match2,B,0,C}

This pattern description for A checks for an occurrence of a B pattern followed by a C pattern.

If B is a relatively simple production (that is, you can convert it to a single pattern data structure), you can optimize this to:

A		pattern	{B's Matching Function, B's parameter, 0, C}

The remaining examples will always call sl_match2, just to be consistent. However, as long as the nonterminals you invoke are simple, you can fold them into A''s pattern.

If a grammar production takes the form A B | C where A, B, and C are nonterminal symbols, you would create the following pattern:

A		pattern	{sl_match2, B, C}

This pattern tries to match B. If it succeeds, A succeeds; if it fails, it tries to match C. At this point, A''s success or failure is the success or failure of C.

Handling terminal symbols is the next thing to consider. These are quite easy - all you need to do is use the appropriate matching function provided by the Standard Library, e.g., matchstr or matchchar. For example, if you have a production of the form A abc | y you would convert this to the following pattern:

A               pattern {matchstr,abc,ypat}

abc             byte    "abc",0

ypat            pattern {matchchar,'y'}

The only remaining detail to consider is the empty string. If you have a production of the form A then you need to write a pattern matching function that always succeed. The elegant way to do this is to write a custom pattern matching function. This function is

succeed         proc    far
                mov     ax, di  ;Required by sl_match
                stc             ;Always succeed.
                ret
succeed         endp

Another, sneaky, way to force success is to use matchstr and pass it the empty string to match, e.g.,

success         pattern {matchstr, emptystr}
emptystr        byte    0

The empty string always matches the input string, no matter what the input string contains.

If you have a production with several alternatives and is one of them, you must process last. For example, if you have the productions A abc | y | BC | you would use the following pattern:

A               pattern {matchstr,abc, tryY}
abc             byte    "abc",0
tryY            pattern {matchchar, 'y', tryBC}
tryBC           pattern {sl_match2, B, DoSuccess, C}
DoSuccess       pattern {succeed}

While the technique described above will let you convert any CFG to a pattern that the Standard Library can process, it certainly does not take advantage of the Standard Library facilities, nor will it produce particularly efficient patterns. For example, consider the production:

Digits   0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Converting this to a pattern using the techniques described above will yield the pattern:

Digits		pattern	{matchchar, '0', try1}
try1		pattern	{matchchar, '1', try2}
try2		pattern	{matchchar, '2', try3}
try3		pattern	{matchchar, '3', try4}
try4		pattern	{matchchar, '4', try5}
try5		pattern	{matchchar, '5', try6}
try6		pattern	{matchchar, '6', try7}
try7		pattern	{matchchar, '7', try8}
try8		pattern	{matchchar, '8', try9}
try9		pattern	{matchchar, '9'}

Obviously this isn't a very good solution because we can match this same pattern with the single statement:

Digits  pattern {anycset, digits}

If your pattern is easy to specify using a regular expression, you should try to encode it using the built-in pattern matching functions and fall back on the above algorithm once you've handled the low level patterns as best you can. With experience, you will be able to choose an appropriate balance between the algorithm in this section and ad hoc methods you develop on your own.


Chapter Sixteen (Part 6)	Table of Content	Chapter Sixteen (Part 8)

Chapter Sixteen: Pattern Matching (Part 7)
29 SEP 1996