User Tools

Site Tools


counting_20the_20characters_20in_20a_20unicode_20string

Counting the characters in a Unicode string

by Richard Russell, March 2010

BBC BASIC for Windows provides native support for the Unicode Basic Multilingual Plane, allowing you to work with, output and print a wide range of foreign-language and other character sets with very little extra effort. The main Help documentation describes how to enable Unicode support.

The Unicode encoding used by BBC BASIC for Windows is UTF-8. This is used in preference to other encodings (for example UTF-16) for the following reasons:

  • UTF-8 is represented as a byte stream, which is compatible with BBC BASIC's string variables, functions and operators.
  • Regular 7-bit ASCII text is represented identically in UTF-8 and ANSI, making it extremely easy to work with such text.
  • UTF-8 is compatible with BBC BASIC's VDU codes; you can mix UTF-8 text and VDU sequences in the same string and PRINT them.
  • You can embed UTF-8 text within a program as string constants or DATA statements (although they will not display as expected in the program editor); UTF-16 cannot be used in this way.
  • UTF-8 has only one version, whereas UTF-16 is byte-order dependent (it has little-endian and big-endian versions).
  • UTF-8 is the preferred Unicode encoding for emails and web pages.

UTF-8 has only one significant disadvantage compared with UTF-16: it is a variable-length encoding. That means you cannot determine the number of characters in a string using the LEN function (it returns the length in bytes, not in characters). Similarly, the COUNT function and features that depend on it (i.e. the WIDTH statement and the TAB(x) function) won't necessarily work as expected. Note that in any case COUNT, WIDTH and TAB(x) aren't generally useful when a proportionally spaced font is in use.

To overcome this disadvantage the function FNulen is listed below. This takes as a parameter a Unicode (UTF-8) string, and returns the length of the string in characters:

        DEF FNulen(U$)
        LOCAL L%
        CP_UTF8 = 65001
        SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), 0, 0 TO L%
        = L%

If passed a string containing only 7-bit ASCII text, the function will return the same value as LEN(U$).

If you need to know the extent (that is, the physical width and height) of a Unicode (UTF-8) string, such as you might if you want to centre it on the screen or a printout, you can use the following procedure:

        DEF PROCuextent(hdc%, U$, size{})
        LOCAL L%, U%
        L% = FNulen(U$)
        DIM U% LOCAL 2*L%
        U% = (U% + 1) AND -2
        SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), U%, L%
        SYS "GetTextExtentPoint32W", hdc%, U%, L%, size{}
        ENDPROC

If passed a string containing only 7-bit ASCII text, the procedure will return the same size as GetTextExtentPoint32.

This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information
counting_20the_20characters_20in_20a_20unicode_20string.txt · Last modified: 2018/04/16 14:46 by richardrussell