UTF-8 String functions.

Discussions related to the code libraries supplied with BB4W & BBCSDL
Post Reply
Zaphod
Posts: 78
Joined: Sat 23 Jun 2018, 15:51

UTF-8 String functions.

Post by Zaphod » Wed 18 Jul 2018, 23:45

BBC BASIC string function keywords such as MID$() assume ANSI and will give strange or misleading results when used on UTF-8 strings.
Below are some alternatives that work correctly on UTF-8 or ANSI.
Here are replacements for INSTR(), MID$, LEFT$, RIGHT$, LEN$ and a function to find the number of bytes in a string.
These should work on all current BBC BASIC versions. They are slower than the original keywords, of course, so you would not normally use them unless UTF-8 strings were involved.

Code: Select all

REM Library functions (corrected to have explicit integer parameters)

      DEF FN_uinstr(a$,b$,st%)   :REM st% is start character number not optional. Set to 0 or 1 to search from start.
      LOCAL S%, T%
      S%=FN_ucount(a$,0,st%-1)+1 :REM start position converted to bytes
      T%=INSTR(a$,b$,S%)        :REM get match result in bytes
      IF T% a$=LEFT$(a$,T%): =FN_ulen(a$)  :REM slice off in bytes and measure it's length in chars
      =0

      DEF FN_umid(a$,st%,num%)
      REM UTF-8 MID$ replacement
      LOCAL S%, N%
      S%=FN_ucount(a$,0,st%-1)+1
      N%=FN_ucount(a$,S%-1,num%)-S%+1
      =MID$(a$,S%,N%)

      DEF FN_uleft(a$,num)
      REM UTF-8 LEFT$ replacement
      =LEFT$(a$,FN_ucount(a$,0,num))

      DEF FN_uright(a$,num%):REM Search from right fastest with small selections.
      LOCAL A%, L%, I%, J%
      A%=!^a$
      FOR I% =LENa$-1 TO 0 STEP -1
        J%=A%?I%
        IF (J% AND &C0) <> &80 : L%+=1 :REM Anything not a continuation is a start of char.
        IF L%=num% EXIT FOR
      NEXT
      =RIGHT$(a$,LENa$-I%)

      DEF FN_ulen(a$)
      REM Finds Length in characters of UTF-8 string.
      LOCAL A%, L%, I%, J%
      A%=!^a$
      WHILE I%<LENa$
        J%=A%?I%
        CASE TRUE OF
            REM Character start bytes
          WHEN (J% AND &E0) = &C0 : L%+=1 : I%+=2
          WHEN (J% AND &80) = 0   : L%+=1 : I%+=1
          WHEN (J% AND &F0) = &E0 : L%+=1 : I%+=3
          WHEN (J% AND &F8) = &F0 : L%+=1 : I%+=4
          WHEN (J% AND &C0) = &80 : I%+=1 : REM Continuation byte. Should never execute!
        ENDCASE
      ENDWHILE
      =L%

      DEF FN_ucount(a$, I%, nchars%)
      REM I% start of count in bytes. Returns total count in bytes adding nchars to start posn.
      LOCAL A%, L%, J%
      A%=!^a$                       :REM Address of start of string
      WHILE L%<=nchars%-1 AND I%<LENa$
        J%=A%?I%                    :REM Get next byte
        CASE TRUE OF
            REM Compare byte with UTF-8 start bytes. Order is most likely found.
          WHEN (J% AND &E0) = &C0 : L%+=1 : I%+=2
          WHEN (J% AND &80) = 0     : L%+=1 : I%+=1
          WHEN (J% AND &F0) = &E0 : L%+=1 : I%+=3
          WHEN (J% AND &F8) = &F0 : L%+=1 : I%+=4
          WHEN (J% AND &C0) = &80 : I%+=1 : REM Continuation byte. Should never execute!
        ENDCASE
      ENDWHILE
      =I% :REM bytes used for nchars
      
Z
Last edited by Zaphod on Fri 20 Jul 2018, 18:23, edited 1 time in total.

User avatar
hellomike
Posts: 34
Joined: Sat 09 Jun 2018, 09:47

Re: UTF-8 String functions.

Post by hellomike » Fri 20 Jul 2018, 15:04

Could be handy indeed!

Just out of curiosity, why do you use floats for some parameters instead of integers?

Mike

Zaphod
Posts: 78
Joined: Sat 23 Jun 2018, 15:51

Re: UTF-8 String functions.

Post by Zaphod » Fri 20 Jul 2018, 18:15

That is an interesting question, Mike.

It is just personal preference to sometimes use variants for integer parameters. On another day I might have used num%, nchars% and st% for the parameters and maybe it would have been better if I had. I had not considered that someone would put in non-integer values so that it actually was a float and had not tested whether that would make it malfunction. In fact it does!!!

So I am going to go back and modify that earlier posting. It is nchar that is the problem but I will change them all.
Thanks, Mike for causing me to look at this again. It is definitely a bug and I am glad that you looked at the code.

Z

Zaphod
Posts: 78
Joined: Sat 23 Jun 2018, 15:51

Re: UTF-8 String functions.

Post by Zaphod » Fri 30 Nov 2018, 19:02

I see that new program releases now have an included equivalent library of these UTF-8 string functions. While this makes my versions largely irrelevant, my version of FN_uright() is considerably faster if you are looking for a small number of characters to be selected. This is because the search starts from the right not the left as in Richard's version. Now unless you have masses of calls to that function speed is probably of no consequence. Before anyone else points it out the other functions of mine are marginally slower!
But if I can speed up FN_ucount a bit? I wonder if that new PTR method helps to address the string? And FOR is probably faster than WHILE.
And I actually understand how mine works, well I did a few weeks back.

Z

guest
Posts: 268
Joined: Mon 02 Apr 2018, 09:12

Re: UTF-8 String functions.

Post by guest » Fri 30 Nov 2018, 19:54

Zaphod wrote:
Fri 30 Nov 2018, 19:02
I see that new program releases now have an included equivalent library of these UTF-8 string functions.
I wasn't able to use your functions because, despite your comment that they "should work on all current BBC BASIC versions", they don't! I think they probably do work in 32-bit versions but not in 64-bit versions (i.e. the iOS and 64-bit Linux editions currently).

Zaphod
Posts: 78
Joined: Sat 23 Jun 2018, 15:51

Re: UTF-8 String functions.

Post by Zaphod » Fri 30 Nov 2018, 23:04

I think they probably do work in 32-bit versions but not in 64-bit versions
I am sure you are right. Those 64 bit versions have 64 bit memory addresses which I had not allowed for. I'll try to remember in future.

So here is the only one of possible interest that I hope will work on 64 bit platforms also:

Code: Select all

  
      DEF FN_uright(a$,num%):REM Search from right fastest with small selections.
      LOCAL a%%, L%, I%, J%
      a%%=PTR(a$)
      FOR I% =LENa$-1 TO 0 STEP -1
        J%=a%%?I%
        IF (J% AND &C0) <> &80 : L%+=1
        IF L%=num% EXIT FOR
      NEXT
      =RIGHT$(a$,LENa$-I%)
Z

Post Reply