|
Table of Content | Chapter Fifteen (Part 7) |
CHAPTER FIFTEEN: STRINGS AND CHARACTER SETS (Part 6) |
15.5 -
The Character Set Routines in the UCR Standard Library 15.6 - Using the String Instructions on Other Data Types 15.6.1 - Multi-precision Integer Strings 15.6.2 - Dealing with Whole Arrays and Records |
15.5 The Character Set Routines in the UCR Standard Library |
The UCR Standard Library provides an extensive collection of character set routines. These routines let you create sets, clear sets (set them to the empty set), add and remove one or more items, test for set membership, copy sets, compute the union, intersection, or difference, and extract items from a set. Although intended to manipulate sets of characters, you can use the StdLib character set routines to manipulate any set with 256 or fewer possible items.
The first unusual thing to note about the StdLib's sets is
their storage format. A 256-bit array would normally consumes 32 consecutive bytes. For
performance reasons, the UCR Standard Library's set format packs eight separate sets into
272 bytes (256 bytes for the eight sets plus 16 bytes overhead). To declare set variables
in your data segment you should use the set
macro. This macro takes the form:
set SetName1, SetName2, ..., SetName8
SetName1..SetName8
represent the names of up to
eight set variables. You may have fewer than eight names in the operand field, but doing
so will waste some bits in the set array.
The CreateSets
routine provides another
mechanism for creating set variables. Unlike the set macro, which you would use to create
set variables in your data segment, the CreateSets
routine allocates storage
for up to eight sets dynamically at run time. It returns a pointer to the first set
variable in es:di
. The remaining seven sets follow at locations es:di+1
,
es:di+2
, ..., es:di+7
. A typical program that allocates set
variables dynamically might use the following code:
Set0 dword ? Set1 dword ? Set2 dword ? Set3 dword ? Set4 dword ? Set5 dword ? Set6 dword ? Set7 dword ? . . . CreateSets mov word ptr Set0+2, es mov word ptr Set1+2, es mov word ptr Set2+2, es mov word ptr Set3+2, es mov word ptr Set4+2, es mov word ptr Set5+2, es mov word ptr Set6+2, es mov word ptr Set7+2, es mov word ptr Set0, di inc di mov word ptr Set1, di inc di mov word ptr Set2, di inc di mov word ptr Set3, di inc di mov word ptr Set4, di inc di mov word ptr Set5, di inc di mov word ptr Set6, di inc di mov word ptr Set7, di inc di
This code segment creates eight different sets on the heap, all empty, and stores pointers to them in the appropriate pointer variables.
The SHELL.ASM file provides a commented-out line of code in
the data segment that includes the file STDSETS.A. This include file provides the bit
definitions for eight commonly used character sets. They are alpha
(upper and
lower case alphabetics), lower
(lower case alphabetics), upper
(upper case alphabetics), digits
("0".."9"), xdigits
("0".."9", "A".."F", and
"a".."f"), alphanum
(upper and lower case alphabetics
plus the digits), whitespace
(space, tab, carriage return, and line feed),
and delimiters
(whitespace plus commas, semicolons, less than, greater than,
and vertical bar). If you would like to use these standard character sets in your program,
you need to remove the semicolon from the beginning of the include
statement
in the SHELL.ASM file.
The UCR Standard Library provides 16 character set
routines: CreateSets
, EmptySet
, RangeSet
, AddStr
,
AddStrl
, RmvStr
, RmvStrl
, AddChar
, RmvChar
,
Member
, CopySet
, SetUnion
, SetIntersect
,
SetDifference
, NextItem
, and RmvItem
. All of these
routines except CreateSets
require a pointer to a character set variable in
the es:di
registers. Specific routines may require other parameters as well.
The EmptySet
routine clears all the bits in a
set producing the empty set. This routine requires the address of the set variable in the es:di
.
The following example clears the set pointed at by Set1
:
les di, Set1 EmptySet
RangeSet
unions in a range of values into the set
variable pointed at by es:di
. The al
register contains the lower
bound of the range of items, ah
contains the upper bound. Note that al
must be less than or equal to ah
. The following example constructs the set of
all control characters (ASCII codes one through 31, the null character [ASCII code zero]
is not allowed in sets):
les di, CtrlCharSet ;Ptr to ctrl char set. mov al, 1 mov ah, 31 RangeSet
AddStr
and AddStrl
add all the
characters in a zero terminated string to a character set. For AddStr
, the dx:si
register pair points at the zero terminated string. For AddStrl
, the zero
terminated string follows the call to AddStrl
in the code stream. These
routines union each character of the specified string into the set. The following examples
add the digits and some special characters into the FPDigits
set:
Digits byte "0123456789",0 set FPDigitsSet FPDigits dword FPDigitsSet . . . ldxi Digits ;Loads DX:SI with adrs of Digits. les di, FPDigits AddStr . . . les di, FPDigits AddStrL byte "Ee.+-",0
RmvStr
and RmvStrl
remove characters
from a set. You supply the characters in a zero terminated string. For RmvStr
,
dx:si
points at the string of characters to remove from the string. For RmvStrl
,
the zero terminated string follows the call. The following example uses RmvStrl to remove
the special symbols from FPDigits above:
les di, FPDigits RmvStrl byte "Ee.+-",0
The AddChar
and RmvChar
routines
let you add or remove individual characters. As usual, es:di
points at the
set; the al
register contains the character you wish to add to the set or
remove from the set. The following example adds a space to the set FPDigits and removes
the "," character (if present):
les di, FPDigits mov al, ' ' AddChar . . . les di, FPDigits mov al, ',' RmvChar
The Member
function checks to see if a
character is in a set. On entry, es:di
must point at the set and al
must contain the character to check. On exit, the zero flag is set if the character is a
member of the set, the zero flag will be clear if the character is not in the set. The
following example reads characters from the keyboard until the user presses a key that is
not a whitespace character:
SkipWS: get ;Read char from user into AL. lesi WhiteSpace ;Address of WS set into es:di. member je SkipWS
The CopySet
, SetUnion
, SetIntersect
,
and SetDifference
routines all operate on two sets of characters. The es:di
register points at the destination character set, the dx:si
register pair
points at a source character set. CopySet
copies the bits from the source set
to the destination set, replacing the original bits in the destination set. SetUnion
computes the union of the two sets and stores the result into the destination set. SetIntersect
computes the set intersection and stores the result into the destination set. Finally, the
SetDifference
routine computes DestSet := DestSet - SrcSet.
The NextItem
and RmvItem
routines
let you extract elements from a set. NextItem returns in al
the ASCII code of
the first character it finds in a set. RmvItem
does the same thing except it
also removes the character from the set. These routines return zero in al
if
the set is empty (StdLib sets cannot contain the NULL character). You can use the RmvItem
routine to build a rudimentary iterator for a character set.
The UCR Standard Library's character set routines are very powerful. With them, you can easily manipulate character string data, especially when searching for different patterns within a string. We will consider this routines again when we study pattern matching later in this text.
The string instructions work with other data types besides character strings. You can use the string instructions to copy whole arrays from one variable to another, to initialize large data structures to a single value, or to compare entire data structures for equality or inequality. Anytime you're dealing with data structures containing several bytes, you may be able to use the string instructions.
15.6.1 Multi-precision Integer Strings
The cmps
instruction is useful for comparing
(very) large integer values. Unlike character strings, we cannot compare integers with cmps
from the L.O. byte through the H.O. byte. Instead, we must compare them from the
H.O. byte down to the L.O. byte. The following code compares two 12-byte integers:
lea di, integer1+10 lea si, integer2+10 mov cx, 6 std repe cmpsw
After the execution of the cmpsw
instruction, the flags will contain the
result of the comparison.
You can easily assign one long integer string to another
using the movs
instruction. Nothing tricky here, just load up the si,
di,
and cx
registers and have at it. You must do other operations,
including arithmetic and logical operations, using the extended precision methods
described in the chapter on arithmetic operations.
15.6.2 Dealing with Whole Arrays and Records
The only operations that apply, in general, to all array
and record structures are assignment and comparison (for equality/inequality only). You
can use the movs
and cmps
instructions for these operations.
Operations such as scalar addition, transposition, etc.,
may be easily synthesized using the lods
and stos
instructions.
The following code shows how you can easily add the value 20 to each element of the
integer array A:
lea si, A mov di, si mov cx, SizeOfA cld AddLoop: lodsw add ax, 20 stosw loop AddLoop
You can implement other operations in a similar fashion.
|
Table of Content | Chapter Fifteen (Part 7) |
Chapter Fifteen: Strings And
Character Sets (Part 6)
28 SEP 1996