User Tools

Site Tools


counting_20the_20characters_20in_20a_20unicode_20string

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

counting_20the_20characters_20in_20a_20unicode_20string [2018/03/31 13:19]
127.0.0.1 external edit
counting_20the_20characters_20in_20a_20unicode_20string [2018/04/16 14:46] (current)
richardrussell Added syntax highlighting
Line 1: Line 1:
 =====Counting the characters in a Unicode string===== =====Counting the characters in a Unicode string=====
  
-//by Richard Russell, March 2010//\\ \\ //BBC BASIC for Windows// provides native support for the [[http://​en.wikipedia.org/​wiki/​Unicode|Unicode]] Basic Multilingual Plane, allowing you to work with, output and print a wide range of foreign-language and other character sets with very little extra effort. The main [[http://​www.bbcbasic.co.uk/​bbcwin/​manual/​bbcwin2.html#​utf8|Help documentation]] describes how to enable Unicode support.\\ \\  The Unicode encoding used by //BBC BASIC for Windows// is [[http://​en.wikipedia.org/​wiki/​UTF-8|UTF-8]]. This is used in preference to other encodings (for example UTF-16) for the following reasons:\\ +//by Richard Russell, March 2010//\\ \\ //BBC BASIC for Windows// provides native support for the [[http://​en.wikipedia.org/​wiki/​Unicode|Unicode]] Basic Multilingual Plane, allowing you to work with, output and print a wide range of foreign-language and other character sets with very little extra effort. The main [[http://​www.bbcbasic.co.uk/​bbcwin/​manual/​bbcwin2.html#​utf8|Help documentation]] describes how to enable Unicode support.\\ \\  The Unicode encoding used by //BBC BASIC for Windows// is [[http://​en.wikipedia.org/​wiki/​UTF-8|UTF-8]]. This is used in preference to other encodings (for example UTF-16) for the following reasons:
  
   * UTF-8 is represented as a //byte stream//, which is compatible with BBC BASIC'​s string variables, functions and operators.   * UTF-8 is represented as a //byte stream//, which is compatible with BBC BASIC'​s string variables, functions and operators.
Line 9: Line 9:
   * UTF-8 has only one version, whereas UTF-16 is byte-order dependent (it has little-endian and big-endian versions).   * UTF-8 has only one version, whereas UTF-16 is byte-order dependent (it has little-endian and big-endian versions).
   * UTF-8 is the preferred Unicode encoding for emails and web pages.   * UTF-8 is the preferred Unicode encoding for emails and web pages.
-\\  ​UTF-8 has only one significant disadvantage compared with UTF-16: it is a variable-length encoding. That means you cannot determine the number of characters in a string using the **LEN** function (it returns the length in bytes, not in characters). Similarly, the **COUNT** function and features that depend on it (i.e. the **WIDTH** statement and the **TAB(x)** function) won't necessarily work as expected. Note that in any case COUNT, WIDTH and TAB(x) aren't generally useful when a **proportionally spaced** font is in use.\\ \\  To overcome this disadvantage the function **FNulen** is listed below. This takes as a parameter a Unicode (UTF-8) string, and returns the length of the string in characters:\\ + 
 +UTF-8 has only one significant disadvantage compared with UTF-16: it is a variable-length encoding. That means you cannot determine the number of characters in a string using the **LEN** function (it returns the length in bytes, not in characters). Similarly, the **COUNT** function and features that depend on it (i.e. the **WIDTH** statement and the **TAB(x)** function) won't necessarily work as expected. Note that in any case COUNT, WIDTH and TAB(x) aren't generally useful when a **proportionally spaced** font is in use.\\ \\  To overcome this disadvantage the function **FNulen** is listed below. This takes as a parameter a Unicode (UTF-8) string, and returns the length of the string in characters: 
 + 
 +<code bb4w>
         DEF FNulen(U$)         DEF FNulen(U$)
         LOCAL L%         LOCAL L%
Line 15: Line 18:
         SYS "​MultiByteToWideChar",​ CP_UTF8, 0, U$, LEN(U$), 0, 0 TO L%         SYS "​MultiByteToWideChar",​ CP_UTF8, 0, U$, LEN(U$), 0, 0 TO L%
         = L%         = L%
-If passed a string containing only 7-bit ASCII text, the function will return the same value as **LEN(U$)**.\\ \\  If you need to know the **extent** (that is, the physical width and height) of a Unicode (UTF-8) string, such as you might if you want to centre it on the screen or a printout, you can use the following procedure:\\ +</​code>​ 
 + 
 +If passed a string containing only 7-bit ASCII text, the function will return the same value as **LEN(U$)**.\\ \\  If you need to know the **extent** (that is, the physical width and height) of a Unicode (UTF-8) string, such as you might if you want to centre it on the screen or a printout, you can use the following procedure: 
 + 
 +<code bb4w>
         DEF PROCuextent(hdc%,​ U$, size{})         DEF PROCuextent(hdc%,​ U$, size{})
         LOCAL L%, U%         LOCAL L%, U%
Line 24: Line 31:
         SYS "​GetTextExtentPoint32W",​ hdc%, U%, L%, size{}         SYS "​GetTextExtentPoint32W",​ hdc%, U%, L%, size{}
         ENDPROC         ENDPROC
 +</​code>​
 +
 If passed a string containing only 7-bit ASCII text, the procedure will return the same size as **GetTextExtentPoint32**. If passed a string containing only 7-bit ASCII text, the procedure will return the same size as **GetTextExtentPoint32**.
counting_20the_20characters_20in_20a_20unicode_20string.txt · Last modified: 2018/04/16 14:46 by richardrussell