CHAPTER SIXTEEN: PATTERN MATCHING (Part 5)

The Art of ASSEMBLY LANGUAGE PROGRAMMING

Chapter Sixteen (Part 4)	Table of Content	Chapter Sixteen (Part 6)


CHAPTER SIXTEEN: PATTERN MATCHING (Part 5)

16.2 - The UCR Standard Library Pattern Matching Routines 16.3 - The Standard Library Pattern Matching Functions 16.3.1 - Spancset 16.3.2 - Brkcset 16.3.3 - Anycset 16.3.4 - Notanycset 16.3.5 - MatchStr 16.3.6 - MatchiStr 16.3.7 - MatchToStr 16.3.8 - MatchChar 16.3.9 - MatchToChar	16.3.10 - MatchChars 16.3.11 - MatchToPat 16.3.12 - EOS 16.3.13 - ARB 16.3.14 - ARBNUM 16.3.15 - Skip 16.3.16 - Pos 16.3.17 - RPos 16.3.18 - GotoPos 16.3.19 - RGotoPos 16.3.20 - SL_Match2

16.2 The UCR Standard Library Pattern Matching Routines

The UCR Standard Library provides a very sophisticated set of pattern matching routines. They are patterned after the pattern matching facilities of SNOBOL4, support CFGs, and provide fully automatic backtracking, as necessary. Furthermore, by writing only five assembly language statements, you can match simple or complex patterns.

There is very little assembly language code to worry about when using the Standard Library's pattern matching routines because most of the work occurs in the data segment. To use the pattern matching routines, you first construct a pattern data structure in the data segment. You then pass the address of this pattern and the string you wish to test to the Standard Library match routine. The match routine returns failure or success depending on the state of the comparison. This isn't quite as easy as it sounds, though; learning how to construct the pattern data structure is almost like learning a new programming language. Fortunately, if you've followed the discussion on context free languages, learning this new "language" is a breeze.

The Standard Library pattern data structure takes the following form:

Pattern         struct
MatchFunction   dword   ?
MatchParm       dword   ?
MatchAlt        dword   ?
NextPattern     dword   ?
EndPattern      word    ?
StartPattern    word    ?
StrSeg          word    ?
Pattern         ends

The MatchFunction field contains the address of a routine to call to perform some sort of comparison. The success or failure of this function determines whether the pattern matches the input string. For example, the UCR Standard Library provides a MatchStr function that compares the next n characters of the input string against some other character string.

The MatchParm field contains the address or value of a parameter (if appropriate) for the MatchFunction routine. For example, if the MatchFunction routine is MatchStr, then the MatchParm field contains the address of the string to compare the input characters against. Likewise, the MatchChar routine compares the next input character in the string against the L.O. byte of the MatchParm field. Some matching functions do not require any parameters, they will ignore any value you assign to MatchParm field. By convention, most programmers store a zero in unused fields of the Pattern structure.

The MatchAlt field contains either zero (NULL) or the address of some other pattern data structure. If the current pattern matches the input characters, the pattern matching routines ignore this field. However, if the current pattern fails to match the input string, then the pattern matching routines will attempt to match the pattern whose address appears in this field. If this alternate pattern returns success, then the pattern matching routine returns success to the caller, otherwise it returns failure. If the MatchAlt field contains NULL, then the pattern matching routine immediately fails if the main pattern does not match.

The Pattern data structure only matches one item. For example, it might match a single character, a single string, or a character from a set of characters. A real world pattern will probably contain several small patterns concatenated together, e.g., the pattern for a Pascal identifier consists of a single character from the set of alphabetic characters followed by one or more characters from the set [a-zA-Z0-9_]. The NextPattern field lets you create a composite pattern as the concatenation of two individual patterns. For such a composite pattern to return success, the current pattern must match and then the pattern specified by the NextPattern field must also match. Note that you can chain as many patterns together as you please using this field.

The last three fields, EndPattern, StartPattern, and StrSeg are for the internal use of the pattern matching routine. You should not modify or examine these fields.

Once you create a pattern, it is very easy to test a string to see if it matches that pattern. The calling sequence for the UCR Standard Library match routine is

                lesi    << Input string to match »
                ldxi    << Pattern to match string against »
                mov     cx, 0
                match
                jc      Success

The Standard Library match routine expects a pointer to the input string in the es:di registers; it expects a pointer to the pattern you want to match in the dx:si register pair. The cx register should contain the length of the string you want to test. If cx contains zero, the match routine will test the entire input string. If cx contains a nonzero value, the match routine will only test the first cx characters in the string. Note that the end of the string (the zero terminating byte) must not appear in the string before the position specified in cx. For most applications, loading cx with zero before calling match is the most appropriate operation.

On return from the match routine, the carry flag denotes success or failure. If the carry flag is set, the pattern matches the string; if the carry flag is clear, the pattern does not match the string. Unlike the examples given in earlier sections, the match routine does not modify the di register, even if the match succeeds. Instead, it returns the failure/success position in the ax register. The is the position of the first character after the match if match succeeds, it is the position of the first unmatched character if match fails.

16.3 The Standard Library Pattern Matching Functions

The UCR Standard Library provides about 20 built-in pattern matching functions. These functions are based on the pattern matching facilities provided by the SNOBOL4 programming language, so they are very powerful indeed! You will probably discover that these routines solve all your pattern matching need, although it is easy to write your own pattern matching routines if an appropriate one is not available. The following subsections describe each of these pattern matching routines in detail.

There are two things you should note if you're using the Standard Library's SHELL.ASM file when creating programs that use pattern matching and character sets. First, there is a line at the very beginning of the SHELL.ASM file that contains the statement "matchfuncs". This line is currently a comment because it contains a semicolon in column one. If you are going to be using the pattern matching facilities of the UCR Standard Library, you need to uncomment this line by deleting the semicolon in column one. If you are going to be using the character set facilities of the UCR Standard Library (very common when using the pattern matching facilities), you may want to uncomment the line containing "include stdsets.a" in the data segment. The "stdsets.a" file includes several common character sets, including alphabetics, digits, alphanumerics, whitespace, and so on.

16.3.1 Spancset

The spancset routine skips over all characters belonging to a character set. This routine will match zero or more characters in the specified set and, therefore, always succeeds. The MatchParm field of the pattern data structure must point at a UCR Standard Library character set variable.

Example:

SkipAlphas      pattern {spancset, alpha}
                 .
                 .
                 .
                lesi    StringWAlphas
                ldxi    SkipAlphas
                xor     cx, cx
                match

16.3.2 Brkcset

Brkcset is the dual to spancset - it matches zero or more characters in the input string which are not members of a specified character set. Another way of viewing brkcset is that it will match all characters in the input string up to a character in the specified character set (or to the end of the string). The matchparm field contains the address of the character set to match.

Example:

DoDigits        pattern {brkcset, digits, 0, DoDigits2}
DoDigits2       pattern {spancset, digits}
                 .
                 .
                 .
                lesi    StringWDigits
                ldxi    DoDigits
                xor     cx, cx
                match
                jnc     NoDigits

The code above matches any string that contains a string of one or more digits somewhere in the string.

16.3.3 Anycset

Anycset matches a single character in the input string from a set of characters. The matchparm field contains the address of a character set variable. If the next character in the input string is a member of this set, anycset set accepts the string and skips over than character. If the next input character is not a member of that set, anycset returns failure.

Example:

DoID            pattern {anycset, alpha, 0, DoID2}
DoID2           pattern {spancset, alphanum}
                 .
                 .
                 .
                lesi    StringWID
                ldxi    DoID
                xor     cx, cx
                match
                jnc     NoID

This code segment checks the string StringWID to see if it begins with an identifier specified by the regular expression [a-zA-Z][a-zA-Z0-9]*. The first subpattern with anycset makes sure there is an alphabetic character at the beginning of the string (alpha is the stdsets.a set variable that has all the alphabetic characters as members). If the string does not begin with an alphabetic, the DoID pattern fails. The second subpattern, DoID2, skips over any following alphanumeric characters using the spancset matching function. Note that spancset always succeeds.

The above code does not simply match a string that is an identifier; it matches strings that begin with a valid identifier. For example, it would match "ThisIsAnID" as well as "ThisIsAnID+SoIsThis - 5". If you only want to match a single identifier and nothing else, you must explicitly check for the end of string in your pattern.

16.3.4 Notanycset

Notanycset provides the complement to anycset - it matches a single character in the input string that is not a member of a character set. The matchparm field, as usual, contains the address of the character set whose members must not appear as the next character in the input string. If notanycset successfully matches a character (that is, the next input character is not in the designated character set), the function skips the character and returns success; otherwise it returns failure.

Example:

DoSpecial       pattern {notanycset, digits, 0, DoSpecial2}
DoSpecial2      pattern {spancset, alphanum}
                 .
                 .
                 .
                lesi    StringWSpecial
                ldxi    DoSpecial
                xor     cx, cx
                match
                jnc     NoSpecial

This code is similar to the DoID pattern in the previous example. It matches a string containing any character except a digit and then matches a string of alphanumeric characters.

16.3.5 MatchStr

Matchstr compares the next set of input characters against a character string. The matchparm field contains the address of a zero terminated string to compare against. If matchstr succeeds, it returns the carry set and skips over the characters it matched; if it fails, it tries the alternate matching function or returns failure if there is no alternate.

Example:

DoString        pattern {matchstr, MyStr}
MyStr           byte    "Match this!",0
                 .
                 .
                 .
                lesi    String
                ldxi    DoString
                xor     cx, cx
                match
                jnc     NotMatchThis

This sample code matches any string that begins with the characters "Match This!"

16.3.6 MatchiStr

Matchistr is like matchstr insofar as it compares the next several characters against a zero terminated string value. However, matchistr does a case insensitive comparison. During the comparison it converts the characters in the input string to upper case before comparing them to the characters that the matchparm field points at. Therefore, the string pointed at by the matchparm field must contain uppercase wherever alphabetics appear. If the matchparm string contains any lower case characters, the matchistr function will always fail.

Example:

DoString        pattern {matchistr, MyStr}
MyStr           byte    "MATCH THIS!",0
                 .
                 .
                 .
                lesi    String
                ldxi    DoString
                xor     cx, cx
                match
                jnc     NotMatchThis

This example is identical to the one in the previous section except it will match the characters "match this!" using any combination of upper and lower case characters.

16.3.7 MatchToStr

Matchtostr matches all characters in an input string up to and including the characters specified by the matchparm parameter. This routine succeeds if the specified string appears somewhere in the input string, it fails if the string does not appear in the input string. This pattern function is quite useful for locating a substring and ignoring everything that came before the substring.

Example:

DoString        pattern {matchtostr, MyStr}
MyStr           byte    "Match this!",0
                 .
                 .
                 .
                lesi    String
                ldxi    DoString
                xor     cx, cx
                match
                jnc     NotMatchThis

Like the previous two examples, this code segment matches the string "Match this!" However, it does not require that the input string (String) begin with "Match this!" Instead, it only requires that "Match this!" appear somewhere in the string.

16.3.8 MatchChar

The matchchar function matches a single character. The matchparm field's L.O. byte contains the character you want to match. If the next character in the input string is that character, then this function succeeds, otherwise it fails.

Example:

DoSpace         pattern {matchchar, ' '}
                 .
                 .
                 .
                lesi    String
                ldxi    DoSpace
                xor     cx, cx
                match
                jnc     NoSpace

This code segment matches any string that begins with a space. Keep in mind that the match routine only checks the prefix of a string. If you wanted to see if the string contained only a space (rather than a string that begins with a space), you would need to explicitly check for an end of string after the space. Of course, it would be far more efficient to use strcmp rather than match for this purpose!

Note that unlike matchstr, you encode the character you want to match directly into the matchparm field. This lets you specify the character you want to test directly in the pattern definition.

16.3.9 MatchToChar

Like matchtostr, matchtochar matches all characters up to and including a character you specify. This is similar to brkcset except you don't have to create a character set containing a single member and brkcset skips up to but not including the specified character(s). Matchtochar fails if it cannot find the specified character in the input string.

Example:

DoToSpace       pattern {matchtochar, ' '}
                 .
                 .
                 .
                lesi    String
                ldxi    DoSpace
                xor     cx, cx
                match
                jnc     NoSpace

This call to match will fail if there are no spaces left in the input string. If there are, the call to matchtochar will skip over all characters up to, and including, the first space. This is a useful pattern for skipping over words in a string.

16.3.10 MatchChars

Matchchars skips zero or more occurrences of a singe character in an input string. It is similar to spancset except you can specify a single character rather than an entire character set with a single member. Like matchchar, matchchars expects a single character in the L.O. byte of the matchparm field. Since this routine matches zero or more occurrences of that character, it always succeeds.

Example:

Skip2NextWord   pattern {matchtochar, ' ', 0, SkipSpcs}
SkipSpcs        pattern {matchchars, ' '}
                 .
                 .
                 .
                lesi    String
                ldxi    Skip2NextWord
                xor     cx, cx
                match
                jnc     NoWord

The code segment skips to the beginning of the next word in a string. It fails if there are no additional words in the string (i.e., the string contains no spaces).

16.3.11 MatchToPat

Matchtopat matches all characters in a string up to and including the substring matched by some other pattern. This is one of the two facilities the UCR Standard Library pattern matching routines provide to allow the implementation of nonterminal function calls. This matching function succeeds if it finds a string matching the specified pattern somewhere on the line. If it succeeds, it skips the characters through the last character matched by the pattern parameter. As you would expect, the matchparm field contains the address of the pattern to match.

Example:

; Assume there is a pattern "expression" that matches arithmetic
; expressions. The following pattern determines if there is such an
; expression on the line followed by a semicolon.

FindExp         pattern {matchtopat, expression, 0, MatchSemi}
MatchSemi       pattern {matchchar, ';'}
                 .
                 .
                 .
                lesi    String
                ldxi    FindExp
                xor     cx, cx
                match
                jnc     NoExp

16.3.12 EOS

The EOS pattern matches the end of a string. This pattern, which must obviously appear at the end of a pattern list if it appears at all, checks for the zero terminating byte. Since the Standard Library routines only match prefixes, you should stick this pattern at the end of a list if you want to ensure that a pattern exactly matches a string with no left over characters at the end. EOS succeeds if it matches the zero terminating byte, it fails otherwise.

Example:

SkipNumber      pattern {anycset, digits, 0, SkipDigits}
SkipDigits      pattern {spancset, digits, 0, EOSPat}
EOSPat          pattern {EOS}
                 .
                 .
                 .
                lesi    String
                ldxi    SkipNumber
                xor     cx, cx
                match
                jnc     NoNumber

The SkipNumber pattern matches strings that contain only decimal digits (from the start of the match to the end of the string). Note that EOS requires no parameters, not even a matchparm parameter.

16.3.13 ARB

ARB matches any number of arbitrary characters. This pattern matching function is equivalent to *. Note that ARB is a very inefficient routine to use. It works by assuming it can match all remaining characters in the string and then tries to match the pattern specified by the nextpattern field. If the nextpattern item fails, ARB backs up one character and tries matching nextpattern again. This continues until the pattern specified by nextpattern succeeds or ARB backs up to its initial starting position. ARB succeeds if the pattern specified by nextpattern succeeds, it fails if it backs up to its initial starting position.

Given the enormous amount of backtracking that can occur with ARB (especially on long strings), you should try to avoid using this pattern if at all possible. The matchtostr, matchtochar, and matchtopat functions accomplish much of what ARB accomplishes, but they work forward rather than backward in the source string and may be more efficient. ARB is useful mainly if you're sure the following pattern appears late in the string you're matching or if the string you want to match occurs several times and you want to match the last occurrence (matchtostr, matchtochar, and matchtopat always match the first occurrence they find).

Example:

SkipNumber      pattern {ARB,0,0,SkipDigit}
SkipDigit       pattern {anycset, digits, 0, SkipDigits}
SkipDigits      pattern {spancset, digits}
                 .
                 .
                 .
                lesi    String
                ldxi    SkipNumber
                xor     cx, cx
                match
                jnc     NoNumber

This code example matches the last number that appears on an input line. Note that ARB does not use the matchparm field, so you should set it to zero by default.

16.3.14 ARBNUM

ARBNUM matches an arbitrary number (zero or more) of patterns that occur in the input string. If R represents some nonterminal number (pattern matching function), then ARBNUM(R ) is equivalent to the production ARBNUM R ARBNUM | .

The matchparm field contains the address of the pattern that ARBNUM attempts to match.

Example:

SkipNumbers     pattern {ARBNUM, SkipNumber}
SkipNumber      pattern {anycset, digits, 0, SkipDigits}
SkipDigits      pattern {spancset, digits, 0, EndDigits}
EndDigits       pattern {matchchars, ' ', EndString}
EndString       pattern {EOS}
                 .
                 .
                 .
                lesi    String
                ldxi    SkipNumbers
                xor     cx, cx
                match
                jnc     IllegalNumbers

This code accepts the input string if it consists of a sequence of zero or more numbers separated by spaces and terminated with the EOS pattern. Note the use of the matchalt field in the EndDigits pattern to select EOS rather than a space for the last number in the string.

16.3.15 Skip

Skip matches n arbitrary characters in the input string. The matchparm field is an integer value containing the number of characters to skip. Although the matchparm field is a double word, this routine limits the number of characters you can skip to 16 bits (65,535 characters); that is, n is the L.O. word of the matchparm field. This should prove sufficient for most needs.

Skip succeeds if there are at least n characters left in the input string; it fails if there are fewer than n characters left in the input string.

Example:

Skip1st6        pattern {skip, 6, 0, SkipNumber}
SkipNumber      pattern {anycset, digits, 0, SkipDigits}
SkipDigits      pattern {spancset, digits, 0, EndDigits}
EndDigits       pattern {EOS}
                 .
                 .
                 .
                lesi    String
                ldxi    Skip1st6
                xor     cx, cx
                match
                jnc     IllegalItem

This example matches a string containing six arbitrary characters followed by one or more decimal digits and a zero terminating byte.

16.3.16 Pos

Pos succeeds if the matching functions are currently at the nth character in the string, where n is the value in the L.O. word of the matchparm field. Pos fails if the matching functions are not currently at position n in the string. Unlike the pattern matching functions you've seen so far, pos does not consume any input characters. Note that the string starts out at position zero. So when you use the pos function, it succeeds if you've matched n characters at that point.

Example:

SkipNumber      pattern {anycset, digits, 0, SkipDigits}
SkipDigits      pattern {spancset, digits, 0, EndDigits}
EndDigits       pattern {pos, 4}
                 .
                 .
                 .
                lesi    String
                ldxi    SkipNumber
                xor     cx, cx
                match
                jnc     IllegalItem

This code matches a string that begins with exactly 4 decimal digits.

16.3.17 RPos

Rpos works quite a bit like the pos function except it succeeds if the current position is n character positions from the end of the string. Like pos, n is the L.O. 16 bits of the matchparm field. Also like pos, rpos does not consume any input characters.

Example:

SkipNumber      pattern {anycset, digits, 0, SkipDigits}
SkipDigits      pattern {spancset, digits, 0, EndDigits}
EndDigits       pattern {rpos, 4}
                 .
                 .
                 .
                lesi    String
                ldxi    SkipNumber
                xor     cx, cx
                match
                jnc     IllegalItem

This code matches any string that is all decimal digits except for the last four characters of the string. The string must be at least five characters long for the above pattern match to succeed.

16.3.18 GotoPos

Gotopos skips over any characters in the string until it reaches character position n in the string. This function fails if the pattern is already beyond position n in the string. The L.O. word of the matchparm field contains the value for n.

Example:

SkipNumber      pattern {gotopos, 10, 0, MatchNmbr}
MatchNmbr       pattern {anycset, digits, 0, SkipDigits}
SkipDigits      pattern {spancset, digits, 0, EndDigits}
EndDigits       pattern {rpos, 4}
                 .
                 .
                 .
                lesi    String
                ldxi    SkipNumber
                xor     cx, cx
                match
                jnc     IllegalItem

This example code skips to position 10 in the string and attempts to match a string of digits starting with the 11th character. This pattern succeeds if the there are four characters remaining in the string after processing all the digits.

16.3.19 RGotoPos

Rgotopos works like gotopos except it goes to the position specified from the end of the string. Rgotopos fails if the matching routines are already beyond position n from the end of the string. As with gotopos, the L.O. word of the matchparm field contains the value for n.

Example:

SkipNumber      pattern {rgotopos, 10, 0, MatchNmbr}
MatchNmbr       pattern {anycset, digits, 0, SkipDigits}
SkipDigits      pattern {spancset, digits}
                 .
                 .
                 .
                lesi    String
                ldxi    SkipNumber
                xor     cx, cx
                match
                jnc     IllegalItem

This example skips to ten characters from the end of the string and then attempts to match one or digits starting at that point. It fails if there aren't at least 11 characters in the string or the last 10 characters don't begin with a string of one or more digits.

16.3.20 SL_Match2

The sl_match2 routine is nothing more than a recursive call to match. The matchparm field contains the address of pattern to match. This is quite useful for simulating parenthesis around a pattern in a pattern expression. As far as matching strings are concerned, pattern1 and pattern2, below, are equivalent:

Pattern2		pattern	{sl_match2, Pattern1}
Pattern1		pattern	{matchchar, 'a'}

The only difference between invoking a pattern directly and invoking it with sl_match2 is that sl_match2 tweaks some internal variables to keep track of matching positions within the input string. Later, you can extract the character string matched by sl_match2 using the patgrab routine.


Chapter Sixteen (Part 4)	Table of Content	Chapter Sixteen (Part 6)

Chapter Sixteen: Pattern Matching (Part 5)
29 SEP 1996