Unicode and You
Marc Hoffman (CEO RemObjects) 30/Oct/2019
Marc Hoffman (CEO RemObjects) 30/Oct/2019
[SHOWTOGROUPS=4,20]
In Elements, the native String type on all target platforms uses UTF16 encoding. UTF16 is a great middle-ground, because 16-bits are enough to represent most common Unicode code points, including not just Latin letters, symbols and accents, but also most other commonly used character sets such as Greek, Cyrillic, or most Asian languages.
But, it is important to keep in mind that Unicode code points are 24 bits, so there is still a wide range of characters that cannot be expressed in a single 16-bit char. Like its sibling-encoding UTF-8, UTF-16 uses Surrogate Pairs to encode this.
Surrogate Pairs
What this means is that a range of 16-bit values, D800-DBFF and DC00- DFFF, is reserved, and whenever a Unicode character does not fit into 16 bits (or falls into the reserved range), it is encoded as two 16-bit values of said range.
Consider this string: "Fun Times!". If we look at the individual 16-bit Chars, we see this:
Note the two values D83D and DE03 — these are a surrogate pair, and they decode to the Unicode char 1F603, which was too large to fit into 16 bit, of course.
So while this string reports a length() of 13, it really only contains 12 unicode code points — which matches what we see visually.
Luckily,Для просмотра ссылки Войдиили Зарегистрируйся adds a few really helpful APIs on its String type that help you deal with this.
First, there's ToUnicodeCodePoints, which returns an array of merged, 32-bit Unicode code points. In this case predictably, it will return this:
Note how all the "regular" characters are untouched, but the surrogate pair has been merged. There's a couple more handy helper functions.
IsIndexInsideOfASurrogatePair returns if an index into the string is in the middle of a surrogate pair. For the above example, it would return true for 5 only. This method can be useful for string manipulations. For example, imagine you were about to insert a character, or split the string, at index 5. That would be a bad idea, as it would break the surrogate pair and result in an invalid UTF-16 string.
I use this, for example, in the Fire code editor. Say the cursor is left of the emoji, and you press Right. It used to be that you'd end up in the middle of the emoji, and could type a space to break it apart. Not anymore, if IsIndexInsideOfASurrogatePair is true, the cursor moves by two indices, so that you're now to the right of the emoji (same goes for pressing Delete, etc).
With that, you might think you're all set and equipped to deal with Unicode — but that's really only half of the story. Enter Joined Characters.
Joined Characters
Consider this string: "WTF?".
Based on the above, you'd probably know what to expect — four regular characters, and maybe a surrogate pair for the Emoji, right? Wrong. Let's look at the character values:
WTF, indeed, right? This time, we have not two but seven UTF-16 chars occupied by the single Emoji. What's going on here? First, lets unpack the surrogate pairs by calling ToUnicodeCodePoints. That, predictably, gives us:
Of the seven chars, two surrogate pairs got merged, and three remained as they are. What's going on here?
It turns out, in Unicode, a single code point (note how I have avoided saying characters, until now) does not necessarily represent a distinct character. Certain code points can combine to perform specific characters. This is used in a manner of cases (for example combining accents with a regular letter), but very common in Emoji — in this example to affect sex and skin tone. What looks like a "medium light skinned women shrugging" actually is the code point for "person shrugging", with a modifier for skin tone, and a modifier for sex:
`W,T,F, ,?
Since Unicode characters can't be expressed by a single hex value, this method returns a list of strings, each one containing all the code points that make up an individual character.
There is also IsIndexInsideOfAJoinedUnicodeCharacter which, again, lets you know if a given string index falls within a joined character, and — because unlike surrogate pairs, joined characters don't have a well-known length of just two, we have StartIndexOfJoinedUnicodeCharacterAtIndex and IndexAfterJoinedUnicodeCharacterCoveringIndex that allow you to find the beginning or the end of a character (including, of course, accounting for surrogate pairs).
Another common example for joined characters are the flag Emoji. Consider "Bon bini na ", which after expanding surrogate pairs expands to:
Unicode actually reserves 26 code points, 1F1E6-$1F1FF, as "regional indicators". Each of the 26 code points represents a letter A thru Z, and any flag can be represented by combining the two letters of the country code, CW in this case.
ToUnicodeCharacters of course handles this fine:
Unicode: It's not as Simple as it Seems
So that's a peek behind UTF-16 and Unicode, and some of the APIs that Elements RTL provides to help you work with Unicode data more safely.
If you want to look at what's going on behind the scenes, I recommend to check out the Для просмотра ссылки Войдиили Зарегистрируйся of the Elements RTL source code, as well as the accompanying Для просмотра ссылки Войди или Зарегистрируйся, which explores a lot of corner cases.
[/SHOWTOGROUPS]
In Elements, the native String type on all target platforms uses UTF16 encoding. UTF16 is a great middle-ground, because 16-bits are enough to represent most common Unicode code points, including not just Latin letters, symbols and accents, but also most other commonly used character sets such as Greek, Cyrillic, or most Asian languages.
But, it is important to keep in mind that Unicode code points are 24 bits, so there is still a wide range of characters that cannot be expressed in a single 16-bit char. Like its sibling-encoding UTF-8, UTF-16 uses Surrogate Pairs to encode this.
Surrogate Pairs
What this means is that a range of 16-bit values, D800-DBFF and DC00- DFFF, is reserved, and whenever a Unicode character does not fit into 16 bits (or falls into the reserved range), it is encoded as two 16-bit values of said range.
Consider this string: "Fun Times!". If we look at the individual 16-bit Chars, we see this:
Код:
46,75,6E,20,D83D,DE03,20,54,69,6D,65,73,21
Note the two values D83D and DE03 — these are a surrogate pair, and they decode to the Unicode char 1F603, which was too large to fit into 16 bit, of course.
So while this string reports a length() of 13, it really only contains 12 unicode code points — which matches what we see visually.
Luckily,Для просмотра ссылки Войди
First, there's ToUnicodeCodePoints, which returns an array of merged, 32-bit Unicode code points. In this case predictably, it will return this:
Код:
46,75,6E,20,1F603,20,54,69,6D,65,73,21
Note how all the "regular" characters are untouched, but the surrogate pair has been merged. There's a couple more handy helper functions.
IsIndexInsideOfASurrogatePair returns if an index into the string is in the middle of a surrogate pair. For the above example, it would return true for 5 only. This method can be useful for string manipulations. For example, imagine you were about to insert a character, or split the string, at index 5. That would be a bad idea, as it would break the surrogate pair and result in an invalid UTF-16 string.
I use this, for example, in the Fire code editor. Say the cursor is left of the emoji, and you press Right. It used to be that you'd end up in the middle of the emoji, and could type a space to break it apart. Not anymore, if IsIndexInsideOfASurrogatePair is true, the cursor moves by two indices, so that you're now to the right of the emoji (same goes for pressing Delete, etc).
With that, you might think you're all set and equipped to deal with Unicode — but that's really only half of the story. Enter Joined Characters.
Joined Characters
Consider this string: "WTF?".
Based on the above, you'd probably know what to expect — four regular characters, and maybe a surrogate pair for the Emoji, right? Wrong. Let's look at the character values:
Код:
57,54,46,D83E,DD37,D83C,DFFC,200D,2640,FE0F,3F
WTF, indeed, right? This time, we have not two but seven UTF-16 chars occupied by the single Emoji. What's going on here? First, lets unpack the surrogate pairs by calling ToUnicodeCodePoints. That, predictably, gives us:
Код:
57,54,46,1F937,1F3FC,200D,2640,FE0F,3F
Of the seven chars, two surrogate pairs got merged, and three remained as they are. What's going on here?
It turns out, in Unicode, a single code point (note how I have avoided saying characters, until now) does not necessarily represent a distinct character. Certain code points can combine to perform specific characters. This is used in a manner of cases (for example combining accents with a regular letter), but very common in Emoji — in this example to affect sex and skin tone. What looks like a "medium light skinned women shrugging" actually is the code point for "person shrugging", with a modifier for skin tone, and a modifier for sex:
- 1F937 (Person shrugging)
- 1F3FC (Skin Color)
- 200D (Zero Width Joiner)
- 2640 (Female Sign)
- FE0F Variation Selector-16, An invisible codepoint which specifies that the preceding character should be displayed with emoji presentation.
`W,T,F, ,?
Since Unicode characters can't be expressed by a single hex value, this method returns a list of strings, each one containing all the code points that make up an individual character.
There is also IsIndexInsideOfAJoinedUnicodeCharacter which, again, lets you know if a given string index falls within a joined character, and — because unlike surrogate pairs, joined characters don't have a well-known length of just two, we have StartIndexOfJoinedUnicodeCharacterAtIndex and IndexAfterJoinedUnicodeCharacterCoveringIndex that allow you to find the beginning or the end of a character (including, of course, accounting for surrogate pairs).
Another common example for joined characters are the flag Emoji. Consider "Bon bini na ", which after expanding surrogate pairs expands to:
Код:
42,6F,6E,20,62,69,6E,69,20,6E,61,20,1F1E8,1F1FC
Unicode actually reserves 26 code points, 1F1E6-$1F1FF, as "regional indicators". Each of the 26 code points represents a letter A thru Z, and any flag can be represented by combining the two letters of the country code, CW in this case.
ToUnicodeCharacters of course handles this fine:
Код:
`B,o,n, ,b,i,n,i, ,n,a, ,
Unicode: It's not as Simple as it Seems
So that's a peek behind UTF-16 and Unicode, and some of the APIs that Elements RTL provides to help you work with Unicode data more safely.
If you want to look at what's going on behind the scenes, I recommend to check out the Для просмотра ссылки Войди
[/SHOWTOGROUPS]