Strings (KEKS)

Next: Integers, Previous: HEXLET, Up: Encoding [Index]

Strings ¶

There are two kinds of strings: binary and UTF-8 (human-readable ones). Most significant tag’s bit is set for them. Seventh bit tells is it UTF-8 string, binary otherwise. Next six bits contain the length of the string.

       len
     +------+
    /        \
1 U L L L L L L
  ^
  +-is it UTF-8?

If length value equals to:

0-60: Use as is.
61: 61 plus next 8-bits value.
62: 62 plus 255 plus next big-endian 16-bits value.
63: 63 plus 255 plus 65535 plus next big-endian 64-bits value.

String’s length must be encoded in shortest possible form.

UTF-8 strings must be valid UTF-8 sequences, except that null byte is not allowed. That should be normalized Unicode string.

Example representations:

0-byte binary string	`80`
4-byte binary string `0x01 0x02 0x03 0x04`	`84 01 02 03 04`
64-byte binary string with 0x41	`BD 03 41 41 .. 41`
UTF-8 string "привет мир" ("hello world" in russian)	`D3 D0 BF D1 80 D0 B8 D0 B2 D0 B5 D1 82 20 D0 BC D0 B8 D1 80`