J8 Notation - Fixing the JSON-Unix Mismatch

J8 Notation is a set of text interchange formats. It's a syntax for:

  1. strings / bytes
  2. tree-shaped records (like JSON)
  3. line-based streams (like Unix)
  4. tables (like TSV)

It's part of the Oils project, and is intended to solve the JSON-Unix Mismatch: the Unix kernel deals with bytes, while JSON deals with Unicode strings (plus UTF-16 errors).

It's backward compatible with JSON, and built on top of it.

But just like JSON isn't only for JavaScript, J8 Notation isn't only for Oils. Any language understands JSON should also understand J8 Notation.

(Note: J8 replaced the similar QSN design in January 2024. QSN wasn't as compatible with both JSON and YSH code.)

Table of Contents
Quick Picture
Goals
Reference
TODO / Diagrams
J8 Strings - Unicode and bytes
Review of JSON strings
J8 Description
What's representable by each style?
Assymmetry of Encoders and Decoders
YSH has 2 of the 3 styles
J8 Strings vs. POSIX Shell Strings
JSON8 - Tree-Shaped Records
Review of JSON
JSON8 Description
J8 Lines - Lines of Text
TSV8 - Table-Shaped Text
Review of TSV
TSV8 Description
Design Notes
Summary
Appendix
Related Links
Future Work
FAQ
Why are byte escapes spelled \yff, and not \xff as in C?
Why have both u'' and b'' strings, if only b'' is technically needed?
Why not use double quotes like u"" and b""?
How do I write a J8 encoder and decoder?
Should a J8 number be mapped to an Int, Float, or Decimal type?
Glossary

Quick Picture

There are 3 styles of J8 strings:

 "hi 🙂 \uD83D\uDE42"      # JSON-style, with surrogate pair

b'hi 🙂 \yF0\y9F\y99\y82'  # Can be ANY bytes, including UTF-8

u'hi 🙂 \u{1F642}'         # nice alternative syntax

They all denote the same decoded string — "hi" and two U+1F642 smiley faces:

hi 🙂 🙂

Why did we add these u'' and b'' strings?


Now, starting with J8 strings, we define the formats JSON8:

{ name: "Alice",
  signature: b'\y01 ... \yff',  # binary data
}

J8 Lines:

  doc/hello.md
 "doc/with spaces.md"
b'doc/with byte \yff.md'

and TSV8:

!tsv8   size    name
!type   Int     Str
        42        doc/hello.md
        55       "doc/with spaces.md"
        99      b'doc/with byte \yff.md'

Together, these are called J8 Notation.

(JSON8 and TSV8 are still to be fully implemented in Oils.).

Goals

  1. Fix the JSON-Unix mismatch: all text formats should be able to express byte strings.
  2. Provide an option to avoid the surrogate pair / UTF-16 legacy of JSON.
  3. Allow expressing metadata about strings vs. bytes.
  4. Turn TSV into an exterior data frame format.

Non-goals:

  1. "Replace" JSON. JSON8 is backward compatible with JSON, and sometimes the lossy encoding is OK.
  2. Resolve the strings vs. bytes dilemma in all situations.

Reference

See the Data Notation Table of Contents in the Oils Reference.

TODO / Diagrams

J8 Strings - Unicode and bytes

Let's review JSON strings, and then describe J8 strings.

Review of JSON strings

JSON strings are enclosed in double quotes, and may have these escape sequences:

\"   \\   \/
\b   \f   \n   \r   \t
\u1234

Properties of JSON:

J8 Description

There are 3 styles of J8 strings:

  1. JSON strings j"", which may be written ""
  2. b'' strings
  3. u'' strings

b'' strings have these escapes:

\yff                # byte escape
\u{1f926}           # code point escape.  UTF-16 escapes like \u1234
                    # are ILLEGAL
\'                  # single quote, in addition to \"
\"  \\  \/          # same as JSON
\b  \f  \n  \r  \t  

(JSON-style double-quoted do not add the \' escape. Except for the optional j prefix, they remain the same.)

Examples:

b''
b'hello'
b'\\'
b'"double" \'single\''
b'nul byte \y00, unicode \u{1f642}'

u'' strings have all the same escapes, but not \yff. This implies that they're always valid unicode strings. (If JSON-style \u1234 escapes were allowed, they wouldn't be.)

Examples:

u''
u'hello'
u'unicode string \u{1f642}' 

A string without a prefix, like 'foo', is equivalent to u'foo':

 'this is a u string'  # discouraged, unless the context is clear

u'this is a u string'  # better to be explicit

What's representable by each style?

These relationships might help you understand the 3 styles of strings:

Strings representable by u''
= All Unicode Strings (no more and no less)

Strings representable by "" (JSON-style)
= All Unicode Strings Surrogate Half Errors

Strings representable by b''
= All Byte Strings

Examples:

Assymmetry of Encoders and Decoders

A few things to notice about J8 encoders:

  1. They can emit only "" strings, possibly using the Unicode replacement char U+FFFD. This is a strict JSON encoder.
  2. They must emit b'' strings to preserve all information, because U+FFFD replacement is lossy.
  3. They never need to emit u'' strings.

On the other hand, J8 decoders must accept all 3 kinds of strings.

YSH has 2 of the 3 styles

A nice property of YSH is that the u'' and b'' strings are valid code:

echo u'hi \u{1f642}'  # u respected in YSH, but not OSH

var myBytes = b'\yff\yfe'

This is useful for correct code generation, and simplifies the language.

But JSON-style strings aren't valid in YSH. The two usages of double quotes can't really be reconciled, because JSON looks like "line\n" and shell looks like "x = ${myvar}".

J8 Strings vs. POSIX Shell Strings

When the encoded form of a J8 string doesn't contain a backslash, it's identical to a POSIX shell string.

In this case, it can make sense to omit the u'' prefix. Example:

shell_string='hi 🙂'

var ysh_str = u'hi 🙂'

var ysh_str =  'hi 🙂'  # same thing

An encoded J8 string has no backslashes when the original string has all these properties:

  1. Valid Unicode (no non-UTF-8 bytes).
  2. No ASCII control characters. All bytes are 0x20 and greater.
  3. No backslashes or single quotes. (All other required escapes are control characters.)

JSON8 - Tree-Shaped Records

Now that we've defined J8 strings, we can define JSON8, an obvious extension of JSON.

(Not implemented yet.)

Review of JSON

See https://json.org

[primitive]     null   true   false
[number]        42  -1.2e-4
[string]        "hello\n"
[array]         [1, 2, 3]
[object]        {"key": 42}

JSON8 Description

JSON8 is like JSON, but:

  1. All strings can be J8 strings — one of the 3 styles describe above.
  2. Object/Dict keys may be unquoted, like {age: 42}
  3. Trailing commas are allowed on objects and arrays: {"d": 42,} and [42,]
  4. End-of-line comments. We use # to be consistent with shell.

Example:

{ name: "Bob",  # comment
  age: 30,
  sig: b'\y00\y01 ... \yff',  # trailing comma, binary data
}

J8 Lines - Lines of Text

J8 Lines is another format built on J8 strings. Each line is either:

  1. An unquoted string, which must be valid UTF-8. Whitespace is allowed, but not other ASCII control chars.
  2. A quoted J8 string (JSON style "" or J8-style b'' u'')
  3. An ignored empty line

In all cases, leading and trailing whitespace is ignored.


For example, 6 strings with weird characters could be represented like this:

  dir/with spaces.txt       # unquoted string must be UTF-8
 "dir/with newline \n.txt"  # JSON-style 
b'dir/with bytes \yff.txt'  # J8-style
u'dir/unicode \u{3bc}'
                            # ignored empty line
 ''                         # empty string, not ignored
 'dir/unicode \u{3bc}'      # no prefix implies u''

Note that J8 strings always occupy one physical line, because they can't contain unescaped control characters, including newlines.

J8 Lines can be viewed as a simpler case of TSV8, described in the next section.

Related

TSV8 - Table-Shaped Text

Let's review TSV, and then describe TSV8.

Review of TSV

TSV has a very short specification:

Example:

name<TAB>age
alice<TAB>44
bob<TAB>33

Limitations:

TSV8 Description

TSV8 is like TSV with:

  1. A !tsv8 prefix and required column names.
  2. An optional !type line, with types Bool Int Float Str.
  3. Other optional column attributes.
  4. Rows of data, each starting with an empty "gutter" column.

Example:

!tsv8   age     name    
!type   Int     Str     # optional types
!other  x       y       # more column metadata
        44        alice
        33        bob
         1       "a\tb"
         2      b'nul \y00'
         3      u'unicode \u{3bc}'

Types:

[Bool]      false   true
[Int]       JSON numbers, restricted to [0-9]+
[Float]     same as JSON
[Str]       J8 string (any of the 3 styles)

Rules for cells:

  1. They can be any of 4 forms in J8 Lines:
    1. Unquoted
    2. JSON-style ""
    3. u''
    4. b''
  2. Leading and trailing whitespace must be stripped, as in J8 Lines.

TODO: What about empty cells? Are they equivalent to null? TSV apparently can't have empty cells, as the rule is [character]+, not [character]+.

Column attributes:

Design Notes

TODO: This section will be filled in as we implement TSV8.

Summary

This document described an upgrade of JSON strings:

And data formats that built on top of these strings:

Appendix

Related Links

Future Work

We could have an SEXP8 format for:

FAQ

Why are byte escapes spelled \yff, and not \xff as in C?

Because in JavaScript and Python, \xff is a code point, not a byte. That is, it's a synonym for \u00ff, which is encoded in UTF-8 as the 2 bytes 0xc3 0xbf.

This is exactly the confusion we want to avoid, so \yff is explicitly different.

One of Chrome's JSON encoders also has this confusion.

Why have both u'' and b'' strings, if only b'' is technically needed?

A few reasons:

  1. Apps in languages like Python and Rust could make use of the distinction. Oils doesn't have a string/bytes distinction (on the "interior"), but many languages do.
  2. Using u'' strings can avoid hacks like WTF-8, which is often required for round-tripping arbitrary JSON messages. Our u'' strings don't require WTF-8 because they can't represent surrogate halves.
  3. u'' strings add trivial weight to the spec, since compared to b'' strings, they simply remove \yff. This is true because encoded J8 strings must be valid UTF-8.

Why not use double quotes like u"" and b""?

J8-style strings could have used double quotes. But single quotes make the new styles more visually distinct from "", and it allows '' as a synonym for u''.

Compared to "" strings, '' strings don't have a UTF-16 legacy.

How do I write a J8 encoder and decoder?

The list of errors at ref/chap-errors.html may be a good starting point.

TODO: describe the Oils implementation.

Should a J8 number be mapped to an Int, Float, or Decimal type?

J8 Notation is like JSON: it only specifies the syntax of messages on the wire.

The mapping of text to types is left to implementers, and depends on the programming language:

OSH and YSH happen to use Int and Float, but this is logically separate from J8 Notation.

Glossary

Formats built on J8 strings:

Generated on Sat, 03 Aug 2024 17:00:28 +0000