Next: , Previous: , Up: Encoding   [Index]


Strings

There are two kinds of strings: binary and UTF-8 (human-readable ones). Most significant tag’s bit is set for them. Seventh bit tells is it UTF-8 string, binary otherwise. Next six bits contain the length of the string.

       len
     +------+
    /        \
1 U L L L L L L
  ^
  +-is it UTF-8?

If length value equals to:

0-60

Use as is.

61

61 plus next 8-bits value.

62

62 plus 255 plus next big-endian 16-bits value.

63

63 plus 255 plus 65535 plus next big-endian 64-bits value.

String’s length must be encoded in shortest possible form.

UTF-8 strings must be valid UTF-8 sequences, except that null byte is not allowed. That should be normalized Unicode string.

Example representations:

0-byte binary string80
4-byte binary string 0x01 0x02 0x03 0x0484 01 02 03 04
64-byte binary string with 0x41BD 03 41 41 .. 41
UTF-8 string "привет мир" ("hello world" in russian)D3 D0 BF D1 80 D0 B8 D0 B2 D0 B5 D1 82 20 D0 BC D0 B8 D1 80