Friday, July 20, 2012


Unicode String Models

Many programming languages (and most modern software) have moved to Unicode model of text. Text coming into the system might be in legacy encodings like Shift-JIS or Latin-1, and text being pushed out the door might also be in legacy encoding, but those are done by explicit conversions at the entry and exit points. Everything else is processed internally in Unicode.

But what model of strings is best for Unicode? I’ll briefly discuss some of the options, and the advantages and disadvantages of each of the main contenders. 


This is still draft, and some sections remain to be fleshed out. Comments and corrections are welcome.

Background

Unicode characters are represented using integer values from 0 to 0x10FFFF, called code points. Not every code point is a character; for example, some are reserved, and get allocated with successive versions of Unicode.

The complication is that there are three different encoding forms for Unicode used to represent these code points. The differences are based on the size of the “code unit”, the basic unit of memory for encoding of that form. These are unsigned ints of width 8, 16, or 32 bits.


Encoding FormBits per Code UnitUnits per Code Point
UTF-8
8
1, 2, 3, or 4
UTF-16
16
1 or 2
UTF-32
32
1

I won’t go into the details of these formats, but the differences are crucial for the string model. The most important issue for APIs is indexing. Let’s look at what the string S = “カβx” looks like in each of the forms in the table below. The # columns indicate the string indexes in terms of code units for each character (the final one being the length).


Chars
#
#
#
β
#
x
#
UTF-8
0
F0 9F 91 BD
4
E3 82 AB
7
CE B2
9
78
10
UTF-16
0
D83D DC7D
2
30AB
3
03B2
4
0078
5
UTF-32
0
0001F47D
1
000030AB
2
000003B2
3
00000078
4

For simple iteration through all the code points in a string, the difference in internals doesn’t matter. For example, such an iteration can be used to see whether the string contains a Greek character.

But the internals matters a great deal for indexing, a common and important operation. We’ll use the notation S[n,m] to mean the substring of S containing code units n ≤ x < m. When indexing in terms of code units:

  • S[0,4] would be カβx”  in UTF-32, カβ” in UTF-16, but ”  in UTF-8
  • S[4,5] would be “x”  in UTF-16, would be an error in UTF-32 (extending beyond the end of the string), and be a different error in UTF-8 (⅓  the way into the character “カ”).
When accessing a native UTF-8 or UTF-16 string, an API could consistently use code point indices, but that costs performance. For example, to determine where code point index 3 is in the UTF-8 string representation of S, that requires scanning from the start of the string. Not a big deal for small strings, but a non-trivial expense for a 10K string, let alone a 10M string.

Frequency and Performance

One relevant point for performance and storage comparisons is that supplementary characters (those requiring 4 bytes in UTF-8/16) are extremely rare as a percentage of all text. They do occur, and are important to support, but have a very low frequency in typical text. That means that it is best to optimize for BMP characters (and as a subset, ASCII and Latin-1), and fall into a ‘slow path’ when a supplementary character is encountered.



Unicode String Models

Based on the indexing scheme, there are four core models for Unicode Strings. The same programming language or library could use or support multiple models, but typically one of them is the default. For example, UTF-16 is the one string model used in Java; C++ supports 3 different string models; ICU supports UTF16, but certain performance-critical APIs like collation also take UTF-8 strings—ICU optimizes some of its internal data structures to work well with both.

Uniform Model

The string class is indexed by code point.
Used by: Python 3.3, Dart, …



Externally, the string looks like it is UTF-32. To conserve space, the internal storage is a uniform array of 8, 16, or 32 bit unsigned integers.
  1. A string whose characters all fit in a byte uses 8-bit units. This covers Latin-1.
  2. A string that has at least one character needing 2 bytes uses 16-bit units. This covers UCS-2 = BMP.
  3. Only a string with a character needing 32 bits needs to expand the array to be 32-bit units.
  4. Values over 0x10FFFF are an error (see Validation)

There are different ways to optimize this model, balancing storage costs against the costs of recopying into a larger (or smaller) unit size. Combined with a Builder or Freeze model, it works particularly well; the mutable builder or unfrozen string uses 32 bit units; only when the immutable String or frozen string is produced is the storage optimized.



Indexing example:



for (int i = 0; i < myString.length(); ++i) {
 codepoint cp = myString[i];
 doSomethingWith(cp);
}

Advantages

  1. Simplest for the programmer
  2. Indexes correspond to Unicode code point boundaries
    1. S[1,3] = “βカ”, no matter what the internal storage
    2. you are never “in the middle” of a character
  3. Random access by code point is fast, constant time
  4. Saves memory compared to UTF-16 for all Latin-1 strings. (See Python 3.3.)
  5. Saves memory compared to UTF-8 if there are many characters above 0x800 (such as CJK, Indic, etc.)

Disadvantages

  1. Adjusting the internal storage in a builder, or compacting for an immutable form, may be expensive.
  2. Low-level access (for performance) is limited: you don’t want to expose the internals.
  3. Most library software is optimized for UTF-16; interfacing with those libraries can involve conversions in and out.
  4. Emulation (such as emulating Dart with JavaScript) is cumbersome.
  5. One supplemental character causes the size to double, making it much worse than UTF-16.
  6. Worse than UTF-8 in memory if is at least one non-Latin-1 character.
  7. “Further from the metal”: less easy to optimize.

Emulation bears some explanation. When implementing the Uniform Model via a language or library that uses UTF-16 or UTF-8, then not just the string contents may need to be converted but also indexes, offsets and lengths. This generally requires walking through the string up to the code point index to find the corresponding code unit index, and vice versa. As an example, consider emulating myString[1,3] in JavaScript. This turns into something like the following:



int first = offsetByCodePoints(myString,0,1);
myString.substring(first, offsetByCodePoints(myString,first,3)).


where offsetByCodePoints takes a string, a code unit offset, and a code point count, and produces a code unit index. There are some techniques for dealing with this: see Optimizations.

UTF-16

The string class is indexed by code unit, and is UTF-16.
Used by: Java, JavaScript, C#, Windows, ICU (Java and C++), DOM,  iOS, OS X, JSON, Qt, KDE, WebKit, Chrome (internal) …



Indexing is by code unit, but there may be helper functions to help deal with code points. No 16 bit values are errors. Typically the string is formally a “Unicode 16-bit String” in memory, that is, it may contain “unpaired” surrogates. See Validation.

Advantages

  1. Good compromise between storage and complexity
  2. Random access by code unit is fast, constant time
  3. Interfacing with most software libraries can avoid conversions in and out
  4. “Close to the metal” (for text in UTF-16).
  5. Less complex than the UTF-8 model.

Disadvantages

  1. More complex for the programmer than the Uniform or UTF-32 models.
  2. Indexes may be into the “middle” of a character
  3. Random access by code point is relatively expensive*, involves scan of string or extra storage.

* However, this can be optimized (esp. for built/frozen strings) by caching the lowest index of any suppl. cp. Below that index, it's 1:1. See Optimizations.



Java indexing example:

int cp;
for (int i = 0; i < myString.length(); i += Character.charCount(cp)) {
 cp = myString.codePointAt(i);
 doSomethingWith(cp);
}



Some of this can be mitigated with semantic sugar. For example, a Java-like language could add a facility for iteration, and could add a datatype for codepoints, thus giving:



for (codepoint cp : myString) {
 doSomethingWith(cp);
}



It quickly turns back to ugly code, though, if the index is needed inside the loop. Iteration through sequences of bytes representing other encoding forms can be done with a fancy iterator.

UTF-8

The string class is indexed by code unit, and is UTF-8.
Used by: Gnome, Linux programs in general, Go, Google C++ generally, protobuf, …



Byte values C0-C1, F5-FF are illegal. Typically the string is formally a “Unicode 8-bit String” in memory, that is, it may contain ill-formed sequences.  See Validation.

Advantages

  1. Matches the encoding used by most web pages, XML documents.
  2. Random access by code unit is fast, constant time
  3. Interfacing with most software libraries can avoid conversions in and out
  4. “Close to the metal” (for text in UTF-8)

Disadvantages

  1. Most complex for the programmer
  2. Indexes may be into the “middle” of a character
  3. Random access by code point is relatively expensive, involves scan of string or extra storage.
  4. Most library software is optimized for UTF-16; interfacing with those libraries can involve conversions in and out.

UTF-32

The string class is indexed by code unit, and is UTF-32.
Used by: glibc?

The API looks like the Uniform Model, but the internals are always 32 bits.

Advantages

  1. Most of those for the Uniform Model.

Disadvantages

  1. Takes up a lot of storage: always 4 bytes, even for ASCII/Latin-1.

Validation

As remarked above, it is typically not necessary to have the contents of a Unicode string be a valid UTF form. It is often faster to avoid doing the extra work on every modification that would need to be done to keep the internals valid. Moreover, in many programming languages, the operations want to be “close to the metal”, allowing the String class to be a cover for a buffer of code units that may or may not be valid.


It is important, however, to be able to convert the string to a conformant UTF form for storage and communication with other processes that will be expecting it. 


In a Unicode 16-bit string, any ill-formed sequence will be an isolated surrogate value. The best approach when interpreting that value is to treat it as if it were a reserved code point, with the associated properties. It is thus not a part of a word, not alphabetic, etc. When converting to well-formed UTF-16, the common technique is change that illegal value to U+FFFD (you never want to just delete it; that has security problems: see Deletion of Code Points).


The combination of these two techniques provides a uniform approach: the length of the string doesn’t change under conversion, and when iterating through the string, you get the same number of code points at the same indices, as when you convert it. Having the same indices (code point and code unit) before and after conversion gives consistent results: you don’t shift S[3,5] to being a different part of the string.



[TBD: flesh out the situation for UTF-8, which is trickier.]

Distinguishing bytes from strings

An important feature of the UTF-8 string model is to have a different class for a UTF-8 string from those representing a‘bucket of bytes’. The latter represents some arbitrary sequence of bytes, such as a JPEG or bytes in another encoding. (Systems that use UTF-16 or the Uniform model are less likely to fall prey to this.) (protocol buffers distinguish the string type from the bytes type, for example)

Optimizations

There are a few key optimizations for a Uniform string model. First, make sure that there are constructs for quickly iterating through the code points in a string. This is not only for ease of use, but also so that these can be converted into the most efficient code when emulated.



for (codepoint cp : myString) {
 doSomethingWith(cp);
}


Because it is important to have high-performance interoperability with UTF-16 APIs, the conversion of strings to and from UTF-16 needs to be as fast as possible, and there need to be optimized methods for converting indexes. That is, suppose that you have a string and a position in that string. To call a UTF-16 API with that, you have to both convert the string to UTF-16 (if it isn’t already 16-bit) and convert the code point offset to the corresponding UTF-16 offset.



// returns the corresponding index in a UTF-16 string
int index16 = string.getIndex16From32(int index32);


// returns the index corresponding to a UTF-16 one.
int index32 = string.getIndex32From16(int index16);



To that purpose, it helps to have additional field in the String class:



// The index of the first code point > 0xFFFF in the string, or maxInt if there is none.

int indexOfFirstSupplementary = maxInt;



At this point and lower, the UTF-16 index is identical to the code point boundary. This allows getIndex16 and getIndexFromIndex16 to be constant time for the beginning (and usually all) of the string. Only if there is a supplemental character in the string, and only for indexes after that character, does a scan through characters need to be performed. (This is used in ICU strings, thanks to Markus Scherer.)

Hex Escapes

UTS #18, Unicode Regular Expressions has a table of escape mechanisms used in various programming languages and protocols. For new languages, a recommended approach for clarity and compactness is the one taken by UTS #18 and by Ruby:

\u{1F47D 20AC A3 61 9}

represents the sequence:

<U+1F47D U+20AC U+00A3 U+0061 U+0009>

So "I like \u{422 43E 43B 441 442 43E 301 439}." represents "I like Толсто́й".


For readability, I also recommend supporting syntax like \N{EURO SIGN} as in Perl and UTS #18.

Literals

If there is only one Unicode string type, no special notation is needed. Otherwise, some semantic flag is required. C++11 uses the following:



u8"This is a UTF-8 string."
u
"This is a UTF-16 string."
U
"This is a UTF-32 string."

Variables

If variables may contain non-ASCII characters, it’s recommended to follow the guidelines in UAX #31, Unicode Identifier and Pattern Syntax, with some additions or subtractions. The properties used for those guidelines are backwards compatible over versions of the Unicode Standard, so once a string qualifies as an identifier in one version, it does in all future versions.

The question of whether to allow non-ASCII characters in variables is open. If the programming language intended for widespread use, and is more “consumer” oriented, it may be useful to provide them. In practice, most multinational organizations tend to restrict identifiers to just ASCII anyway, because they are more likely to be readable by all people working on the code.

Loose Ends

[TBD: Discuss class structure: superclass has api in code points; subclasses have additional api in code units.]

[TBD: Mention Python performance measurements.]