UTF-8 String functions.

Discussions related to the code libraries supplied with BB4W & BBCSDL
Post Reply
Zaphod
Posts: 39
Joined: Sat 23 Jun 2018, 15:51

UTF-8 String functions.

Post by Zaphod » Wed 18 Jul 2018, 23:45

BBC BASIC string function keywords such as MID$() assume ANSI and will give strange or misleading results when used on UTF-8 strings.
Below are some alternatives that work correctly on UTF-8 or ANSI.
Here are replacements for INSTR(), MID$, LEFT$, RIGHT$, LEN$ and a function to find the number of bytes in a string.
These should work on all current BBC BASIC versions. They are slower than the original keywords, of course, so you would not normally use them unless UTF-8 strings were involved.

Code: Select all

REM Library functions (corrected to have explicit integer parameters)

      DEF FN_uinstr(a$,b$,st%)   :REM st% is start character number not optional. Set to 0 or 1 to search from start.
      LOCAL S%, T%
      S%=FN_ucount(a$,0,st%-1)+1 :REM start position converted to bytes
      T%=INSTR(a$,b$,S%)        :REM get match result in bytes
      IF T% a$=LEFT$(a$,T%): =FN_ulen(a$)  :REM slice off in bytes and measure it's length in chars
      =0

      DEF FN_umid(a$,st%,num%)
      REM UTF-8 MID$ replacement
      LOCAL S%, N%
      S%=FN_ucount(a$,0,st%-1)+1
      N%=FN_ucount(a$,S%-1,num%)-S%+1
      =MID$(a$,S%,N%)

      DEF FN_uleft(a$,num)
      REM UTF-8 LEFT$ replacement
      =LEFT$(a$,FN_ucount(a$,0,num))

      DEF FN_uright(a$,num%):REM Search from right fastest with small selections.
      LOCAL A%, L%, I%, J%
      A%=!^a$
      FOR I% =LENa$-1 TO 0 STEP -1
        J%=A%?I%
        IF (J% AND &C0) <> &80 : L%+=1 :REM Anything not a continuation is a start of char.
        IF L%=num% EXIT FOR
      NEXT
      =RIGHT$(a$,LENa$-I%)

      DEF FN_ulen(a$)
      REM Finds Length in characters of UTF-8 string.
      LOCAL A%, L%, I%, J%
      A%=!^a$
      WHILE I%<LENa$
        J%=A%?I%
        CASE TRUE OF
            REM Character start bytes
          WHEN (J% AND &E0) = &C0 : L%+=1 : I%+=2
          WHEN (J% AND &80) = 0   : L%+=1 : I%+=1
          WHEN (J% AND &F0) = &E0 : L%+=1 : I%+=3
          WHEN (J% AND &F8) = &F0 : L%+=1 : I%+=4
          WHEN (J% AND &C0) = &80 : I%+=1 : REM Continuation byte. Should never execute!
        ENDCASE
      ENDWHILE
      =L%

      DEF FN_ucount(a$, I%, nchars%)
      REM I% start of count in bytes. Returns total count in bytes adding nchars to start posn.
      LOCAL A%, L%, J%
      A%=!^a$                       :REM Address of start of string
      WHILE L%<=nchars%-1 AND I%<LENa$
        J%=A%?I%                    :REM Get next byte
        CASE TRUE OF
            REM Compare byte with UTF-8 start bytes. Order is most likely found.
          WHEN (J% AND &E0) = &C0 : L%+=1 : I%+=2
          WHEN (J% AND &80) = 0     : L%+=1 : I%+=1
          WHEN (J% AND &F0) = &E0 : L%+=1 : I%+=3
          WHEN (J% AND &F8) = &F0 : L%+=1 : I%+=4
          WHEN (J% AND &C0) = &80 : I%+=1 : REM Continuation byte. Should never execute!
        ENDCASE
      ENDWHILE
      =I% :REM bytes used for nchars
      
Z
Last edited by Zaphod on Fri 20 Jul 2018, 18:23, edited 1 time in total.

User avatar
hellomike
Posts: 13
Joined: Sat 09 Jun 2018, 09:47

Re: UTF-8 String functions.

Post by hellomike » Fri 20 Jul 2018, 15:04

Could be handy indeed!

Just out of curiosity, why do you use floats for some parameters instead of integers?

Mike

Zaphod
Posts: 39
Joined: Sat 23 Jun 2018, 15:51

Re: UTF-8 String functions.

Post by Zaphod » Fri 20 Jul 2018, 18:15

That is an interesting question, Mike.

It is just personal preference to sometimes use variants for integer parameters. On another day I might have used num%, nchars% and st% for the parameters and maybe it would have been better if I had. I had not considered that someone would put in non-integer values so that it actually was a float and had not tested whether that would make it malfunction. In fact it does!!!

So I am going to go back and modify that earlier posting. It is nchar that is the problem but I will change them all.
Thanks, Mike for causing me to look at this again. It is definitely a bug and I am glad that you looked at the code.

Z

Post Reply