Tutorial 05: TText - Advanced Operations
Overview
This tutorial covers advanced TText operations that make text manipulation powerful and efficient. You'll learn about:
KMP Search Algorithm - Fast substring searching with O(n+m) complexity
UTF-8 Character Handling - Working with multi-byte Unicode characters
Coordinate Conversion - Converting between different text position systems
These operations are essential for building text editors, search functionality, and handling international text.
Topics Covered
1. KMP Search Algorithm
Understanding the Knuth-Morris-Pratt algorithm
Preparing search patterns efficiently
Finding all occurrences of a substring
Case-sensitive and case-insensitive search
2. UTF-8 Character Handling
UTF-8 encoding basics
Converting between byte positions and character indices
Working with multi-byte characters
Handling international text correctly
3. Coordinate Conversion
Three coordinate systems in TText
Converting between offset, position, and index
Understanding when to use each coordinate system
Practical examples of conversion
Prerequisites
Before starting this tutorial, you should have completed:
Tutorial 04: TText - Gap Buffer Basics
Understanding of TText structure and basic operations
Knowledge of UTF-8 encoding (helpful but not required)
Demos
| Demo | Filename | Functions Covered |
| 16 | demo16_ttext_search.asm | TextPrepareSearch, TextSearch |
| 17 | demo17_ttext_unicode.asm | TextIndexToPos, TextPosToIndex |
| 18 | demo18_ttext_coords.asm | TextPosToOffset, TextOffsetToPos |
Function Reference
Search Functions
TextPrepareSearch- Preprocess pattern for KMP searchTextSearch- Find substring using KMP algorithm
UTF-8 Functions
TextIndexToPos- Convert character index to byte positionTextPosToIndex- Convert byte position to character index
Coordinate Conversion Functions
TextPosToOffset- Convert position to offset (including gap)TextOffsetToPos- Convert offset to position (excluding gap)
Building and Running
cd 05-ttext-advanced
./build.sh
This will compile and test all 3 demos.
Important Discoveries During Implementation
1. Bash Variable SECONDS Conflicts with Build Script Parsing
Problem: The build script showed arithmetic syntax errors like invalid arithmetic operator: error token is ".1"
Discovery: The bash built-in variable SECONDS was being used in the script to capture FASM's timing output. This caused issues with the arithmetic context.
Solution: Rename the variable to avoid conflict:
# WRONG - conflicts with bash built-in:
SECONDS=$(echo "$OUTPUT" | grep "passes" | awk '{print $3}')
# CORRECT - use different name:
TIME_VAL=$(echo "$OUTPUT" | grep "passes" | awk '{print $3}')
2. FASM Angle Bracket Strings with Quotes Cause Parse Errors
Problem: Inline strings like <" ['> or <"']"> cause "missing end quote" errors in FASM.
Discovery: The FASM preprocessor has trouble with double-quotes inside angle brackets when certain quote/bracket combinations appear. The parser gets confused about string boundaries.
Solutions (in order of preference):
Option 1: Use single quotes as outer delimiter (BEST - simplest):
; WRONG - causes parse error:
stdcall FileWriteString, [STDOUT], <" ['>
; CORRECT - swap to single quotes:
stdcall FileWriteString, [STDOUT], <' ['>
stdcall FileWriteString, [STDOUT], <'] '>
Option 2: Define constants in iglobal (good for reusable strings):
iglobal
cQuoteOpen text " ['"
cQuoteClose text "']"
endg
stdcall FileWriteString, [STDOUT], cQuoteOpen
stdcall FileWriteString, [STDOUT], cQuoteClose
Option 3: String concatenation (for complex cases):
; Mix quote types safely:
stdcall FileWriteString, [STDOUT], <'"Hello"'>
Pattern:
Use single quotes (
<'text'>) when string contains double quotesUse double quotes (
<"text") when string contains single quotesDefine constants in
iglobalfor complex or reusable stringsNever use
<>with same quote type inside that matches the outer delimiter
3. NumToStr Requires 32-bit Register, Not 8-bit
Problem: Code like stdcall NumToStr, al, ntsHex causes pushd al error "invalid size of operand"
Discovery: NumToStr uses stdcall which tries to push arguments on the stack. You cannot push 8-bit registers (AL, BL, CL, DL).
Solution: Always zero-extend 8-bit values to 32-bit before calling NumToStr:
; WRONG - tries to push 8-bit register:
stdcall NumToStr, al, ntsHex
; CORRECT - extend to 32-bit first:
movzx eax, al ; Convert 8-bit to 32-bit
stdcall NumToStr, eax, ntsHex
Why This Matters: The stdcall macro expands to push arguments, and x86 can only push 16-bit or 32-bit values, not 8-bit.
4. TextPrepareSearch Returns Memory That Must Be Freed
Problem: Forgetting to free the index table returned by TextPrepareSearch causes memory leaks.
Discovery: TextPrepareSearch allocates memory for the KMP prefix table and returns a pointer in EAX. This must be freed with FreeMem when done.
Solution: Always track and free the index table:
stdcall TextPrepareSearch, [hSearch], tsfCaseSensitive
mov [pIndexTable], eax
test eax, eax
jz .error
; ... use the table for searches ...
; CRITICAL - free when done:
stdcall FreeMem, [pIndexTable]
5. TextSearch Requires Matching Flags Between Prepare and Search
Problem: Searching with different flags than used in TextPrepareSearch gives incorrect results.
Discovery: The KMP prefix table is built based on case-sensitivity settings. Using mismatched flags causes the search to fail.
Solution: Use the same flags consistently:
; Prepare with case-insensitive:
stdcall TextPrepareSearch, [hSearch], tsfCaseIgnore
mov [pIndexTable], eax
; Search MUST also use case-insensitive:
stdcall TextSearch, [pText], [hSearch], 0, [pIndexTable], tsfCaseIgnore
6. TextIndexToPos and TextPosToIndex Handle UTF-8 Correctly
Discovery: These functions automatically account for UTF-8 multi-byte characters. You don't need to manually count UTF-8 bytes.
Example:
; For text "H€llö" (11 bytes, 5 characters):
; Index 3 ('l') = Position 6 (after "H€ll")
; Index 4 ('ö') = Position 9 (after "H€llö")
stdcall TextIndexToPos, [pText], 3 ; Returns position 6
stdcall TextPosToIndex, [pText], 6 ; Returns index 3
7. Coordinate System Conversion Depends on Gap Position
Discovery: TextPosToOffset and TextOffsetToPos results vary based on where the gap is located. Offset includes the gap, position excludes it.
Example:
Text: "Hello World" (11 bytes)
Gap at position 6 (after "Hello ")
Position 6 = Offset 6 (before gap)
Position 7 = Offset 18 (after gap, gap is 12 bytes)
Why This Matters: When working with raw memory addresses (offsets), you must account for the gap. Use positions for logical text operations.
8. TextGetChar Returns Character in AL Register
Discovery: TextGetChar returns the UTF-8 first byte in AL, not a full handle or pointer. For multi-byte characters, you need to read additional bytes.
Pattern:
stdcall TextGetChar, [pText] ; Returns first byte in AL
cmp al, $80 ; Check if multi-byte
jb .ascii_char ; Single byte if < $80
; ... handle multi-byte UTF-8 ...
Summary of Best Practices
Swap quote types when needed in FASM inline strings
Use
<'text with "quotes"'>when you need double quotes insideUse
"text with 'quotes'"when you need single quotes insideOnly define constants in
iglobalfor complex/reusable strings
Zero-extend 8-bit values before stdcall
Use
movzx eax, albefore calling functions with byte valuesRemember stdcall pushes arguments on stack
Free memory from TextPrepareSearch
Track the returned index table pointer
Call
FreeMemwhen done searching
Match search flags consistently
Use same flags in Prepare and Search calls
Case sensitivity must match
Use position for logical operations
Position excludes gap - what you usually want
Offset includes gap - for raw memory access only
Index for UTF-8 character counting
Handle UTF-8 multi-byte characters correctly
Check if byte >= $80 for multi-byte
Use TextIndexToPos/TextPosToIndex for conversion
Next Steps
After completing this tutorial, continue to:
Tutorial 06: TText - Real-World Patterns