Multi-byte strings

From WikiPrizm
Jump to: navigation, search

To support strings with special characters, the OS uses a custom encoding in which some characters take not one byte, but two. Here's an example of such a string:

 const char* string = "S\xE6\x16me \xE6\x02c\xE6\x08\xE6\x0Bnts"

When printed to screen with a function prepared for dealing with multi-byte strings, the previous string would be shown like this:


You can see that certain characters are encoded as they would be on any other platform that supports ASCII. But the character set used by the OS is non-standard: certain characters do not correspond to their common ASCII meaning, and characters above 0x7F do not correspond to any known character encodings at all. For example, if a string contains a line feed character (code 10, usually referred in C with the code "\n"), none of the known text printing syscalls actually interpret it as such, and instead display a graphical representation of the code:


Related syscalls

There are various syscalls related to handling multi-byte strings, including detection of the "leading" byte and of the second byte, and even special versions of strcpy, strcmp and strcat (which aren't really necessary, as the usual implementations of these functions appear to work just fine with multi-byte strings). So far, the only documented one is MB_ElementCount, which allows for getting the number of characters, as printed, on a string.

Characters supported

Most latin accents are defined as multi-byte characters. The whole Greek alphabet also appears to be supported as multi-byte characters, and same for the Russian Cyrillic alphabet.

Hanzi/kanji, Hiragana and Katakana characters are supported by the OS, allowing for Chinese and Japanese strings, but they do not appear to be accessible as multi-byte characters, at least not in the same way others are. It is possible that some flags need to be set, or special syscalls be used, for these characters to be printed. It is also possible that they use a completely different encoding than multi-byte. These characters also use another font (which includes replacements for many Latin characters) that takes more horizontal space than the PrintXY font, meaning the "homescreen" is reduced to 16 rows instead of 21. See Fonts for more information.

Font support

Not all fonts included in the OS support all the characters - in fact, some don't even support many ASCII codes otherwise supported by other OS fonts. See Fonts for more information.