Friday, July 20, 2012


Unicode String Models

Many programming languages (and most modern software) have moved to Unicode model of text. Text coming into the system might be in legacy encodings like Shift-JIS or Latin-1, and text being pushed out the door might also be in legacy encoding, but those are done by explicit conversions at the entry and exit points. Everything else is processed internally in Unicode.

But what model of strings is best for Unicode? I’ll briefly discuss some of the options, and the advantages and disadvantages of each of the main contenders. 


This is still draft, and some sections remain to be fleshed out. Comments and corrections are welcome.

Background

Unicode characters are represented using integer values from 0 to 0x10FFFF, called code points. Not every code point is a character; for example, some are reserved, and get allocated with successive versions of Unicode.

The complication is that there are three different encoding forms for Unicode used to represent these code points. The differences are based on the size of the “code unit”, the basic unit of memory for encoding of that form. These are unsigned ints of width 8, 16, or 32 bits.


Encoding FormBits per Code UnitUnits per Code Point
UTF-8
8
1, 2, 3, or 4
UTF-16
16
1 or 2
UTF-32
32
1

I won’t go into the details of these formats, but the differences are crucial for the string model. The most important issue for APIs is indexing. Let’s look at what the string S = “カβx” looks like in each of the forms in the table below. The # columns indicate the string indexes in terms of code units for each character (the final one being the length).


Chars
#
#
#
β
#
x
#
UTF-8
0
F0 9F 91 BD
4
E3 82 AB
7
CE B2
9
78
10
UTF-16
0
D83D DC7D
2
30AB
3
03B2
4
0078
5
UTF-32
0
0001F47D
1
000030AB
2
000003B2
3
00000078
4

For simple iteration through all the code points in a string, the difference in internals doesn’t matter. For example, such an iteration can be used to see whether the string contains a Greek character.

But the internals matters a great deal for indexing, a common and important operation. We’ll use the notation S[n,m] to mean the substring of S containing code units n ≤ x < m. When indexing in terms of code units:

  • S[0,4] would be カβx”  in UTF-32, カβ” in UTF-16, but ”  in UTF-8
  • S[4,5] would be “x”  in UTF-16, would be an error in UTF-32 (extending beyond the end of the string), and be a different error in UTF-8 (⅓  the way into the character “カ”).
When accessing a native UTF-8 or UTF-16 string, an API could consistently use code point indices, but that costs performance. For example, to determine where code point index 3 is in the UTF-8 string representation of S, that requires scanning from the start of the string. Not a big deal for small strings, but a non-trivial expense for a 10K string, let alone a 10M string.

Frequency and Performance

One relevant point for performance and storage comparisons is that supplementary characters (those requiring 4 bytes in UTF-8/16) are extremely rare as a percentage of all text. They do occur, and are important to support, but have a very low frequency in typical text. That means that it is best to optimize for BMP characters (and as a subset, ASCII and Latin-1), and fall into a ‘slow path’ when a supplementary character is encountered.



Unicode String Models

Based on the indexing scheme, there are four core models for Unicode Strings. The same programming language or library could use or support multiple models, but typically one of them is the default. For example, UTF-16 is the one string model used in Java; C++ supports 3 different string models; ICU supports UTF16, but certain performance-critical APIs like collation also take UTF-8 strings—ICU optimizes some of its internal data structures to work well with both.

Uniform Model

The string class is indexed by code point.
Used by: Python 3.3, Dart, …



Externally, the string looks like it is UTF-32. To conserve space, the internal storage is a uniform array of 8, 16, or 32 bit unsigned integers.
  1. A string whose characters all fit in a byte uses 8-bit units. This covers Latin-1.
  2. A string that has at least one character needing 2 bytes uses 16-bit units. This covers UCS-2 = BMP.
  3. Only a string with a character needing 32 bits needs to expand the array to be 32-bit units.
  4. Values over 0x10FFFF are an error (see Validation)

There are different ways to optimize this model, balancing storage costs against the costs of recopying into a larger (or smaller) unit size. Combined with a Builder or Freeze model, it works particularly well; the mutable builder or unfrozen string uses 32 bit units; only when the immutable String or frozen string is produced is the storage optimized.



Indexing example:



for (int i = 0; i < myString.length(); ++i) {
 codepoint cp = myString[i];
 doSomethingWith(cp);
}

Advantages

  1. Simplest for the programmer
  2. Indexes correspond to Unicode code point boundaries
    1. S[1,3] = “βカ”, no matter what the internal storage
    2. you are never “in the middle” of a character
  3. Random access by code point is fast, constant time
  4. Saves memory compared to UTF-16 for all Latin-1 strings. (See Python 3.3.)
  5. Saves memory compared to UTF-8 if there are many characters above 0x800 (such as CJK, Indic, etc.)

Disadvantages

  1. Adjusting the internal storage in a builder, or compacting for an immutable form, may be expensive.
  2. Low-level access (for performance) is limited: you don’t want to expose the internals.
  3. Most library software is optimized for UTF-16; interfacing with those libraries can involve conversions in and out.
  4. Emulation (such as emulating Dart with JavaScript) is cumbersome.
  5. One supplemental character causes the size to double, making it much worse than UTF-16.
  6. Worse than UTF-8 in memory if is at least one non-Latin-1 character.
  7. “Further from the metal”: less easy to optimize.

Emulation bears some explanation. When implementing the Uniform Model via a language or library that uses UTF-16 or UTF-8, then not just the string contents may need to be converted but also indexes, offsets and lengths. This generally requires walking through the string up to the code point index to find the corresponding code unit index, and vice versa. As an example, consider emulating myString[1,3] in JavaScript. This turns into something like the following:



int first = offsetByCodePoints(myString,0,1);
myString.substring(first, offsetByCodePoints(myString,first,3)).


where offsetByCodePoints takes a string, a code unit offset, and a code point count, and produces a code unit index. There are some techniques for dealing with this: see Optimizations.

UTF-16

The string class is indexed by code unit, and is UTF-16.
Used by: Java, JavaScript, C#, Windows, ICU (Java and C++), DOM,  iOS, OS X, JSON, Qt, KDE, WebKit, Chrome (internal) …



Indexing is by code unit, but there may be helper functions to help deal with code points. No 16 bit values are errors. Typically the string is formally a “Unicode 16-bit String” in memory, that is, it may contain “unpaired” surrogates. See Validation.

Advantages

  1. Good compromise between storage and complexity
  2. Random access by code unit is fast, constant time
  3. Interfacing with most software libraries can avoid conversions in and out
  4. “Close to the metal” (for text in UTF-16).
  5. Less complex than the UTF-8 model.

Disadvantages

  1. More complex for the programmer than the Uniform or UTF-32 models.
  2. Indexes may be into the “middle” of a character
  3. Random access by code point is relatively expensive*, involves scan of string or extra storage.

* However, this can be optimized (esp. for built/frozen strings) by caching the lowest index of any suppl. cp. Below that index, it's 1:1. See Optimizations.



Java indexing example:

int cp;
for (int i = 0; i < myString.length(); i += Character.charCount(cp)) {
 cp = myString.codePointAt(i);
 doSomethingWith(cp);
}



Some of this can be mitigated with semantic sugar. For example, a Java-like language could add a facility for iteration, and could add a datatype for codepoints, thus giving:



for (codepoint cp : myString) {
 doSomethingWith(cp);
}



It quickly turns back to ugly code, though, if the index is needed inside the loop. Iteration through sequences of bytes representing other encoding forms can be done with a fancy iterator.

UTF-8

The string class is indexed by code unit, and is UTF-8.
Used by: Gnome, Linux programs in general, Go, Google C++ generally, protobuf, …



Byte values C0-C1, F5-FF are illegal. Typically the string is formally a “Unicode 8-bit String” in memory, that is, it may contain ill-formed sequences.  See Validation.

Advantages

  1. Matches the encoding used by most web pages, XML documents.
  2. Random access by code unit is fast, constant time
  3. Interfacing with most software libraries can avoid conversions in and out
  4. “Close to the metal” (for text in UTF-8)

Disadvantages

  1. Most complex for the programmer
  2. Indexes may be into the “middle” of a character
  3. Random access by code point is relatively expensive, involves scan of string or extra storage.
  4. Most library software is optimized for UTF-16; interfacing with those libraries can involve conversions in and out.

UTF-32

The string class is indexed by code unit, and is UTF-32.
Used by: glibc?

The API looks like the Uniform Model, but the internals are always 32 bits.

Advantages

  1. Most of those for the Uniform Model.

Disadvantages

  1. Takes up a lot of storage: always 4 bytes, even for ASCII/Latin-1.

Validation

As remarked above, it is typically not necessary to have the contents of a Unicode string be a valid UTF form. It is often faster to avoid doing the extra work on every modification that would need to be done to keep the internals valid. Moreover, in many programming languages, the operations want to be “close to the metal”, allowing the String class to be a cover for a buffer of code units that may or may not be valid.


It is important, however, to be able to convert the string to a conformant UTF form for storage and communication with other processes that will be expecting it. 


In a Unicode 16-bit string, any ill-formed sequence will be an isolated surrogate value. The best approach when interpreting that value is to treat it as if it were a reserved code point, with the associated properties. It is thus not a part of a word, not alphabetic, etc. When converting to well-formed UTF-16, the common technique is change that illegal value to U+FFFD (you never want to just delete it; that has security problems: see Deletion of Code Points).


The combination of these two techniques provides a uniform approach: the length of the string doesn’t change under conversion, and when iterating through the string, you get the same number of code points at the same indices, as when you convert it. Having the same indices (code point and code unit) before and after conversion gives consistent results: you don’t shift S[3,5] to being a different part of the string.



[TBD: flesh out the situation for UTF-8, which is trickier.]

Distinguishing bytes from strings

An important feature of the UTF-8 string model is to have a different class for a UTF-8 string from those representing a‘bucket of bytes’. The latter represents some arbitrary sequence of bytes, such as a JPEG or bytes in another encoding. (Systems that use UTF-16 or the Uniform model are less likely to fall prey to this.) (protocol buffers distinguish the string type from the bytes type, for example)

Optimizations

There are a few key optimizations for a Uniform string model. First, make sure that there are constructs for quickly iterating through the code points in a string. This is not only for ease of use, but also so that these can be converted into the most efficient code when emulated.



for (codepoint cp : myString) {
 doSomethingWith(cp);
}


Because it is important to have high-performance interoperability with UTF-16 APIs, the conversion of strings to and from UTF-16 needs to be as fast as possible, and there need to be optimized methods for converting indexes. That is, suppose that you have a string and a position in that string. To call a UTF-16 API with that, you have to both convert the string to UTF-16 (if it isn’t already 16-bit) and convert the code point offset to the corresponding UTF-16 offset.



// returns the corresponding index in a UTF-16 string
int index16 = string.getIndex16From32(int index32);


// returns the index corresponding to a UTF-16 one.
int index32 = string.getIndex32From16(int index16);



To that purpose, it helps to have additional field in the String class:



// The index of the first code point > 0xFFFF in the string, or maxInt if there is none.

int indexOfFirstSupplementary = maxInt;



At this point and lower, the UTF-16 index is identical to the code point boundary. This allows getIndex16 and getIndexFromIndex16 to be constant time for the beginning (and usually all) of the string. Only if there is a supplemental character in the string, and only for indexes after that character, does a scan through characters need to be performed. (This is used in ICU strings, thanks to Markus Scherer.)

Hex Escapes

UTS #18, Unicode Regular Expressions has a table of escape mechanisms used in various programming languages and protocols. For new languages, a recommended approach for clarity and compactness is the one taken by UTS #18 and by Ruby:

\u{1F47D 20AC A3 61 9}

represents the sequence:

<U+1F47D U+20AC U+00A3 U+0061 U+0009>

So "I like \u{422 43E 43B 441 442 43E 301 439}." represents "I like Толсто́й".


For readability, I also recommend supporting syntax like \N{EURO SIGN} as in Perl and UTS #18.

Literals

If there is only one Unicode string type, no special notation is needed. Otherwise, some semantic flag is required. C++11 uses the following:



u8"This is a UTF-8 string."
u
"This is a UTF-16 string."
U
"This is a UTF-32 string."

Variables

If variables may contain non-ASCII characters, it’s recommended to follow the guidelines in UAX #31, Unicode Identifier and Pattern Syntax, with some additions or subtractions. The properties used for those guidelines are backwards compatible over versions of the Unicode Standard, so once a string qualifies as an identifier in one version, it does in all future versions.

The question of whether to allow non-ASCII characters in variables is open. If the programming language intended for widespread use, and is more “consumer” oriented, it may be useful to provide them. In practice, most multinational organizations tend to restrict identifiers to just ASCII anyway, because they are more likely to be readable by all people working on the code.

Loose Ends

[TBD: Discuss class structure: superclass has api in code points; subclasses have additional api in code units.]

[TBD: Mention Python performance measurements.]

4 comments :

  1. As part of the escapes/literal discussion, you may want to mention Python's \N{EURO SIGN}, which people find more legible than \u20ac.

    ReplyDelete
  2. In a programming language, it would be useful to know the maximum code point that might be in a string, for optimized algorithms working on strings. For example, lowercasing of ASCII strings. The maximum is easy when the string was built/frozen, for which it had to be inspected. At a minimum, it would be good to know that the maximum is 7f, ff, (maybe d7ff), ffff or 10ffff. An actual maximum might be handy too. (Someone might have a Latin fastpath that extends to 17f, or one for anything below any BiDi, below any CJK, etc.)

    Do Python and Dart allow (unpaired) surrogate code points? If so, it would be useful to have API for whether a string contains one.

    Internally, it would also be useful to know that an 8-bit string only contains ASCII characters (max=7f), because that is valid UTF-8 and can be handed directly to other libraries using UTF-8.

    ReplyDelete
  3. I think you've misplaced Dart in this list; it isn't standing alongside Python 3.3, it's more like JavaScript. At least, that's true according to http://api.dartlang.org/docs/releases/latest/dart_core/Runes.html which says that strings are UTF-16. In my opinion, the only viable options are UTF-32 and the Python 3.3 system, which is an optimized version of UTF-32. Everything else is going to cause problems. Incidentally, if you're concerned about the performance implications of the multi-representation system, check out Pike - it's another high level language (open source, too) and it's been using this kind of string for years. The optimization potential is actually huge... and all without affecting application code at all.

    @Markus: Python 3.3 does allow unpaired surrogates, though that may change at some point. Pike also does, but that's because its string type is the one that would be used for a UTF-16 string as well as a true Unicode one (and it's the same type that's used for Latin-1 strings and UTF-8 strings too). However, they are simply representing themselves. There's nothing that pairs them up into characters. They are simply codepoints that represent illegal characters.

    As to the max codepoint - Pike, as of recent versions, does record exactly that:

    Pike v7.9 release 11 running Hilfe v3.5 (Incremental Pike Frontend)
    > typeof("Hello, world!");
    (1) Result: string(32..119)

    I don't know of a way to query this in application code, but it also keeps track of whether ASCII-only strings include any upper-case or lower-case letters. Lower-casing an all-ASCII string with no upper-case letters in it is optimized down to returning the input unchanged, and ditto the converse:

    > string foo="test";
    > gauge {for (int i=0;i<10000000;++i) upper_case(foo);};
    (5) Result: 2.016181428
    > gauge {for (int i=0;i<10000000;++i) lower_case(foo);};
    (6) Result: 0.951476813
    > gauge {for (int i=0;i<10000000;++i) foo;};
    (7) Result: 0.680354775

    The savings get even greater with longer strings. Pike is highly optimized for string handling (with sscanf expressions being an extremely fast and readable tool that does a lot of what most people turn to regexps for) and networking, making it a great choice for running internet servers. It also happens to be my favorite language in all the world :)

    ReplyDelete