OILS / doc / j8-notation.md View on Github | oilshell.org

607 lines, 419 significant
1---
2default_highlighter: oils-sh
3---
4
5J8 Notation - Fixing the JSON-Unix Mismatch
6===========
7
8J8 Notation is a set of text interchange formats. It's a syntax for:
9
101. **strings** / bytes
111. tree-shaped **records** (like [JSON]($xref))
121. line-based **streams** (like Unix)
131. **tables** (like TSV)
14
15It's part of the Oils project, and is intended to solve the *JSON-Unix
16Mismatch*: the Unix kernel deals with bytes, while JSON deals with Unicode
17strings (plus UTF-16 errors).
18
19It's backward compatible with [JSON]($xref), and built on top of
20it.
21
22But just like JSON isn't only for JavaScript, J8 Notation isn't only for Oils.
23Any language understands JSON should also understand J8 Notation.
24
25(Note: J8 replaced the similar [QSN](qsn.html) design in January
262024. QSN wasn't as compatible with both JSON and YSH code.)
27
28<div id="toc">
29</div>
30
31## Quick Picture
32
33<style>
34 .uni4 {
35 /* color: #111; */
36 }
37 .dq {
38 color: darkred;
39 }
40 .sq {
41 color: #111;
42 }
43</style>
44
45There are 3 styles of J8 strings:
46
47<pre style="font-size: x-large;">
48 <span class=dq>"</span>hi &#x1f642; \u<span class=uni4>D83D</span>\u<span class=uni4>DE42</span><span class=dq>"</span> <span class="sh-comment"># JSON-style, with surrogate pair</span>
49
50<span class=sq>b'</span>hi &#x1f642; \yF0\y9F\y99\y82<span class=sq>'</span> <span class="sh-comment"># Can be ANY bytes, including UTF-8</span>
51
52<span class=sq>u'</span>hi &#x1f642; \u{1F642}<span class=sq>'</span> <span class="sh-comment"># nice alternative syntax</span>
53</pre>
54
55They all denote the same decoded string &mdash; "hi" and two `U+1F642` smiley
56faces:
57
58<pre style="font-size: x-large;">
59hi &#x1f642; &#x1f642;
60</pre>
61
62Why did we add these `u''` and `b''` strings?
63
64- We want to represent any string that a Unix kernel can emit (`argv` arrays,
65 env variables, filenames, file contents, etc.)
66 - J8 encoders emit `b''` strings to avoid losing information.
67- `u''` strings are like `b''` strings, but they can only express valid
68 Unicode strings.
69
70<!-- They can't express arbitrary binary data, and there's no such thing as a
71surrogate pair or half. -->
72
73---
74
75Now, starting with J8 strings, we define the formats JSON8:
76
77 { name: "Alice",
78 signature: b'\y01 ... \yff', // binary data
79 }
80
81J8 Lines:
82
83 doc/hello.md
84 "doc/with spaces.md"
85 b'doc/with byte \yff.md'
86
87and TSV8:
88
89 !tsv8 size name
90 !type Int Str
91 42 doc/hello.md
92 55 "doc/with spaces.md"
93 99 b'doc/with byte \yff.md'
94
95Together, these are called *J8 Notation*.
96
97(JSON8 and TSV8 are still to be fully implemented in Oils.).
98
99## Goals
100
1011. Fix the **JSON-Unix mismatch**: all text formats should be able to express
102 byte strings.
103 - But it's OK to use plain JSON in Oils, e.g. when filenames are known to be
104 strings.
1051. Provide an option to avoid the surrogate pair / **UTF-16 legacy** of JSON.
1061. Allow expressing metadata about **strings vs. bytes**.
1071. Turn TSV into an **exterior** [data
108 frame](https://www.oilshell.org/blog/2018/11/30.html) format.
109 - Unix tools like `awk`, `cut`, and `sort` already understand tables
110 informally.
111
112<!--
113 - TSV8 cells can represent arbitrary binary data, including tabs and
114 newlines.
115-->
116
117Non-goals:
118
1191. "Replace" JSON. JSON8 is backward compatible with JSON, and sometimes the
120 lossy encoding is OK.
1211. Resolve the strings vs. bytes dilemma in all situations.
122 - Like JSON, our spec is **syntactic**. We don't specify a mapping from J8
123 strings to interior data types in any particular language.
124
125<!--
126## J8 Notation in As Few Words As Possible
127
128J8 Strings are a superset of JSON strings:
129
130Only valid unicode:
131
132<pre style="font-size: x-large;">
133u'hi &#x1f926; \u{1f926}' &rarr; hi &#x1f926; &#x1f926;
134</pre>
135
136JSON: unicode + surrogate halves:
137
138<pre style="font-size: x-large;">
139 "hi &#x1f926; \ud83e\udd26" &rarr; hi &#x1f926; &#x1f926;
140 "\ud83e"
141</pre>
142
143Any byte string:
144
145<pre style="font-size: x-large;">
146b'hi &#x1f926; \u{1f926} \yf0\y9f\ya4\ya6' &rarr; hi &#x1f926; &#x1f926; &#x1f926;
147b'\yff'
148</pre>
149
150## Structured Formats
151
152### JSON8
153
154### TSV8
155
1561. Required first row with column names
1571. Optional second row with column types
1581. Gutter Column
159
160-->
161
162## Reference
163
164See the [Data Notation Table of Contents](ref/toc-data.html) in the [Oils
165Reference](ref/index.html).
166
167### TODO / Diagrams
168
169- Diagram of Evolution
170 - JSON strings &rarr; J8 Strings
171 - J8 strings as a building block &rarr; JSON8 and TSV8
172- Venn Diagrams of Data Language Relationships
173 - If you add the left "gutter" column, every TSV is valid TSV8.
174 - Every TSV8 is also syntactically valid TSV. For example, you can import it
175 into a spreadsheet, and remove/ignore the gutter column and type row.
176 - TODO: make a screenshot and test it
177- Doc: How to turn a JSON library into a J8 Notation library.
178 - Issue: an interior type that can represent byte strings.
179
180## J8 Strings - Unicode and bytes
181
182Let's review JSON strings, and then describe J8 strings.
183
184### Review of JSON strings
185
186JSON strings are enclosed in double quotes, and may have these escape
187sequences:
188
189 \" \\ \/
190 \b \f \n \r \t
191 \u1234
192
193Properties of JSON:
194
195- The encoded form must also be valid UTF-8.
196- The encoded form can't contain literal control characters, including literal
197 tabs or newlines. (This is good for TSV8, because it means a literal tab is
198 always a field separator.)
199
200### J8 Description
201
202There are 3 **styles** of J8 strings:
203
2041. JSON strings `""`
2051. `b''` strings
2061. `u''` strings
207
208`b''` strings have these escapes:
209
210 \yff # byte escape
211 \u{1f926} # code point escape. UTF-16 escapes like \u1234
212 # are ILLEGAL
213 \' # single quote, in addition to \"
214 \" \\ \/ # same as JSON
215 \b \f \n \r \t
216
217(JSON-style double-quoted strings remain the same in J8 Notation; they do not
218add the `\'` escape.)
219
220Examples:
221
222 b''
223 b'hello'
224 b'\\'
225 b'"double" \'single\''
226 b'nul byte \y00, unicode \u{1f642}'
227
228`u''` strings have all the same escapes, but **not** `\yff`. This implies that
229they're always valid unicode strings. (If JSON-style `\u1234` escapes were
230allowed, they wouldn't be.)
231
232Examples:
233
234 u''
235 u'hello'
236 u'unicode string \u{1f642}'
237
238A string *without* a prefix, like `'foo'`, is equivalent to `u'foo'`:
239
240 'this is a u string' # discouraged, unless the context is clear
241
242 u'this is a u string' # better to be explicit
243
244### What's representable by each style?
245
246<style>
247#subset {
248 text-align: center;
249 background-color: #DEE;
250 padding-top: 0.5em; padding-bottom: 0.5em;
251 margin-left: 3em; margin-right: 3em;
252}
253.set {
254 font-size: x-large;
255}
256</style>
257
258These relationships might help you understand the 3 styles of strings:
259
260<div id="subset">
261
262<span class="set">Strings representable by `u''`</span><br/>
263&equals; All Unicode Strings (no more and no less)
264
265<b>&subset;</b>
266
267<span class="set">Strings representable by `""`</span> (JSON-style)<br/>
268&equals; All Unicode Strings <b>&cup;</b> Surrogate Half Errors
269
270<b>&subset;</b>
271
272<span class="set">Strings representable by `b''`</span></br>
273&equals; All Byte Strings
274
275</div>
276
277Examples:
278
279- The JSON message `"\udd26"` represents a string that's not Unicode &mdash; it
280 has a surrogate half error. This string is **not** representable with `u''`
281 strings.
282- The J8 message `b'\yff'` represents a byte string. This string is **not**
283 representable with JSON strings or `u''` strings.
284
285### Assymmetry of Encoders and Decoders
286
287A few things to notice about J8 **encoders**:
288
2891. They can emit only `""` strings, possibly using the Unicode replacement char
290 `U+FFFD`. This is a strict JSON encoder.
2911. They *must* emit `b''` strings to preserve all information, because `U+FFFD`
292 replacement is lossy.
2931. They *never* need to emit `u''` strings.
294 - This is because `""` strings (and `b''` strings) can represent all values
295 that `u''` strings can. Still, `u''` strings may be desirable in some
296 situations, like when you want `\u{1f642}` escapes, or to assert that a
297 value must be a valid Unicode string.
298
299On the other hand, J8 **decoders** must accept all 3 kinds of strings.
300
301### YSH has 2 of the 3 styles
302
303A nice property of YSH is that the `u''` and `b''` strings are valid code:
304
305 echo u'hi \u{1f642}' # u respected in YSH, but not OSH
306
307 var myBytes = b'\yff\yfe'
308
309This is useful for correct code generation, and simplifies the language.
310
311But JSON-style strings aren't valid in YSH. The two usages of double quotes
312can't really be reconciled, because JSON looks like `"line\n"` and shell looks
313like `"x = ${myvar}"`.
314
315### J8 Strings vs. POSIX Shell Strings
316
317When the encoded form of a J8 string doesn't contain a **backslash**, it's
318identical to a POSIX shell string.
319
320In this case, it can make sense to omit the `u''` prefix. Example:
321
322<pre>
323shell_string='hi &#x1f642;'
324
325var ysh_str = u'hi &#x1f642;'
326
327var ysh_str = 'hi &#x1f642;' <span class="sh-comment"># same thing</span>
328</pre>
329
330An encoded J8 string has no backslashes when the original string has all these
331properties:
332
3331. Valid Unicode (no non-UTF-8 bytes).
3341. No ASCII control characters. All bytes are `0x20` and greater.
3351. No backslashes or single quotes. (All other required escapes are control
336 characters.)
337
338
339## JSON8 - Tree-Shaped Records
340
341Now that we've defined J8 strings, we can define JSON8, an obvious extension of
342JSON.
343
344(Not implemented yet.)
345
346### Review of JSON
347
348See <https://json.org>
349
350 [primitive] null true false
351 [number] 42 -1.2e-4
352 [string] "hello\n"
353 [array] [1, 2, 3]
354 [object] {"key": 42}
355
356### JSON8 Description
357
358JSON8 is like JSON, but:
359
3601. All strings can be J8 strings &mdash; one of the **3 styles** describe
361 above.
3621. Object/Dict keys may be **unquoted**, like `{age: 42}`
363 - Unquoted keys must be a valid JS identifier name matching the pattern
364 `[a-zA-Z_][a-zA-Z0-9_]*`.
3651. **Trailing commas** are allowed on objects and arrays: `{"d": 42,}` and `[42,]`
3661. End-of-line comments. We use `#` to be consistent with shell.
367
368<!--
369Note that // is consistent with JavaScript / JSON5, but it actually conflicts
370with Scheme symbols, which we use for NIL8. These are both valid Scheme, and
371probably NIL8:
372
373 (/ 5 3)
374 (// 5 3) # This should not start a comment!
375-->
376
377Example:
378
379```
380{ name: "Bob", // comment
381 age: 30,
382 sig: b'\y00\y01 ... \yff', // trailing comma, binary data
383}
384```
385
386<!--
387!json8 # optional prefix to distinguish from JSON
388
389I think using unquoted keys is a good enough signal, or MIME type.
390
391-->
392
393## J8 Lines - Lines of Text
394
395*J8 Lines* is another format built on J8 strings. Each line is either:
396
3971. An unquoted string, which must be valid UTF-8. Whitespace is allowed, but
398 not other ASCII control chars.
3992. A quoted J8 string (JSON style `""` or J8-style `b'' u''`)
4003. An **ignored** empty line
401
402In all cases, leading and trailing whitespace is ignored.
403
404---
405
406For example, 6 strings with weird characters could be represented like this:
407
408 dir/with spaces.txt # unquoted string must be UTF-8
409 "dir/with newline \n.txt" # JSON-style
410 b'dir/with bytes \yff.txt' # J8-style
411 u'dir/unicode \u{3bc}'
412 # ignored empty line
413 '' # empty string, not ignored
414 'dir/unicode \u{3bc}' # no prefix implies u''
415
416Note that J8 strings always occupy **one** physical line, because they can't
417contain unescaped control characters, including newlines.
418
419*J8 Lines* can be viewed as a simpler case of TSV8, described in the next
420section.
421
422<!--
423
424TODO: show grammar, which disallows anything but significant tabs/newlines, and
425insignificant spaces)
426-->
427
428#### Related
429
430- <https://jsonlines.org/> allows not just strings, but any value like `{}` and
431 `[]`. We could define an obvious "JSON8 Lines" format, which is different
432 than "J8 Lines".
433
434## TSV8 - Table-Shaped Text
435
436Let's review TSV, and then describe TSV8.
437
438### Review of TSV
439
440TSV has a very short specification:
441
442- <https://www.iana.org/assignments/media-types/text/tab-separated-values>
443
444Example:
445
446```
447name<TAB>age
448alice<TAB>44
449bob<TAB>33
450```
451
452Limitations:
453
454- Fields can't contain tabs or newlines.
455- There's no escaping, so unprintable bytes in field values result in an
456 unprintable TSV file.
457- Spaces are easy to confuse with tabs.
458
459### TSV8 Description
460
461TSV8 is like TSV with:
462
4631. A `!tsv8` prefix and required column names.
4642. An optional `!type` line, with types `Bool Int Float Str`.
4653. Other optional column attributes.
4664. Rows of data, each starting with an empty "gutter" column.
467
468Example:
469
470```
471!tsv8 age name
472!type Int Str # optional types
473!other x y # more column metadata
474 44 alice
475 33 bob
476 1 "a\tb"
477 2 b'nul \y00'
478 3 u'unicode \u{3bc}'
479```
480
481Types:
482
483```
484[Bool] false true
485[Int] JSON numbers, restricted to [0-9]+
486[Float] same as JSON
487[Str] J8 string (any of the 3 styles)
488```
489
490Rules for cells:
491
4921. They can be any of 4 forms in J8 Lines:
493 1. Unquoted
494 1. JSON-style `""`
495 1. `u''`
496 1. `b''`
4971. Leading and trailing whitespace must be stripped, as in J8 Lines.
498
499TODO: What about empty cells? Are they equivalent to `null`? TSV apparently
500can't have empty cells, as the rule is `[character]+`, not `[character]+`.
501
502Column attributes:
503
504- `!format` could be Instant / Duration?
505
506### Design Notes
507
508TODO: This section will be filled in as we implement TSV8.
509
510- Null Issues:
511 - Are bools nullable? Seems like no reason, but you could be missing
512 - Are ints nullable? In SQL they probably are
513 - Are floats nullable? Yes, like NA in R.
514 - Decoders can use a parallel typed column to indicate nulls?
515
516- It's OK to use plain TSV in YSH programs as well. You don't have to add
517 types if you don't want to.
518
519
520## Summary
521
522This document described an upgrade of JSON strings:
523
524- J8 Strings (in 3 styles)
525
526And data formats that built on top of these strings:
527
528- JSON8 - tree-shaped records
529- J8 Lines - Unix streams
530- TSV8 - table-shaped data
531
532## Appendix
533
534### Related Links
535
536- <https://json.org/>
537- JSON extensions
538 - <https://json5.org/>
539 - [JSON with Commas and
540 Comments](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
541 - Survey: <https://github.com/json-next/awesome-json-next>
542
543### Future Work
544
545We could have an SEXP8 format for:
546
547- Concrete syntax trees, with location information
548- Textual IRs like WebAssembly
549
550## FAQ
551
552### Why are byte escapes spelled `\yff`, and not `\xff` as in C?
553
554Because in JavaScript and Python, `\xff` is a **code point**, not a byte. That
555is, it's a synonym for `\u00ff`, which is encoded in UTF-8 as the 2 bytes `0xc3
5560xbf`.
557
558This is **exactly** the confusion we want to avoid, so `\yff` is explicitly
559different.
560
561One of Chrome's JSON encoders [also has this
562confusion](https://source.chromium.org/chromium/chromium/src/+/main:base/json/json_reader.h;l=27;drc=d0919138b7951c1a154cf802a68aad7904b6f4c9).
563
564### Why have both `u''` and `b''` strings, if only `b''` is technically needed?
565
566A few reasons:
567
5681. Apps in languages like Python and Rust could make use of the distinction.
569 Oils doesn't have a string/bytes distinction (on the "interior"), but many
570 languages do.
5711. Using `u''` strings can avoid hacks like
572 [WTF-8](http://simonsapin.github.io/wtf-8/), which is often required for
573 round-tripping arbitrary JSON messages. Our `u''` strings don't require
574 WTF-8 because they can't represent surrogate halves.
5751. `u''` strings add trivial weight to the spec, since compared to `b''`
576 strings, they simply remove `\yff`. This is true because *encoded* J8 strings
577 must be valid UTF-8.
578
579### Why not use double quotes like `u""` and `b""`?
580
581J8-style strings could have used double quotes. But single quotes make the new
582styles more visually distinct from `""`, and it allows `''` as a synonym for
583`u''`.
584
585Compared to `""` strings, `''` strings don't have a UTF-16 legacy.
586
587### How do I write a J8 encoder and decoder?
588
589The list of errors at [ref/chap-errors.html](ref/chap-errors.html) may be a
590good starting point.
591
592TODO: describe the Oils implementation.
593
594## Glossary
595
596- **J8 Strings** - the building block for JSON8 and TSV8. There are 3 similar
597 syntaxes: `"foo"` and `b'foo'` and `u'foo'`.
598- **JSON strings** - double quoted strings `"foo"`.
599- **J8-style strings** - either `b'foo'` or `u'foo'`.
600
601Formats built on J8 strings:
602
603- **J8 Lines** - unquoted and J8 strings, one per line.
604- **JSON8** - An upgrade of JSON.
605- **TSV8** - An upgrade of TSV.
606
607