OILS / doc / j8-notation.md View on Github | oilshell.org

597 lines, 412 significant
1---
2default_highlighter: oils-sh
3---
4
5J8 Notation - Fixing the JSON-Unix Mismatch
6===========
7
8J8 Notation is a set of text interchange formats. It's a syntax for:
9
101. **strings** / bytes
111. tree-shaped **records** (like [JSON]($xref))
121. line-based **streams** (like Unix)
131. **tables** (like TSV)
14
15It's part of the Oils project, and is intended to solve the *JSON-Unix
16Mismatch*: the Unix kernel deals with bytes, while JSON deals with Unicode
17strings (plus UTF-16 errors).
18
19It's backward compatible with [JSON]($xref), and built on top of
20it.
21
22But just like JSON isn't only for JavaScript, J8 Notation isn't only for Oils.
23Any language understands JSON should also understand J8 Notation.
24
25(Note: J8 replaced the similar [QSN](qsn.html) design in January
262024. QSN wasn't as compatible with both JSON and YSH code.)
27
28<div id="toc">
29</div>
30
31## Quick Picture
32
33<style>
34 .uni4 {
35 /* color: #111; */
36 }
37 .dq {
38 color: darkred;
39 }
40 .sq {
41 color: #111;
42 }
43</style>
44
45There are 3 styles of J8 strings:
46
47<pre style="font-size: x-large;">
48 <span class=dq>"</span>hi &#x1f642; \u<span class=uni4>D83D</span>\u<span class=uni4>DE42</span><span class=dq>"</span> <span class="sh-comment"># JSON-style, with surrogate pair</span>
49
50<span class=sq>b'</span>hi &#x1f642; \yF0\y9F\y99\y82<span class=sq>'</span> <span class="sh-comment"># Can be ANY bytes, including UTF-8</span>
51
52<span class=sq>u'</span>hi &#x1f642; \u{1F642}<span class=sq>'</span> <span class="sh-comment"># nice alternative syntax</span>
53</pre>
54
55They all denote the same decoded string &mdash; "hi" and two `U+1F642` smiley
56faces:
57
58<pre style="font-size: x-large;">
59hi &#x1f642; &#x1f642;
60</pre>
61
62Why did we add these `u''` and `b''` strings?
63
64- We want to represent any string that a Unix kernel can emit (`argv` arrays,
65 env variables, filenames, file contents, etc.)
66 - J8 encoders emit `b''` strings to avoid losing information.
67- `u''` strings are like `b''` strings, but they can only express valid
68 Unicode strings.
69
70<!-- They can't express arbitrary binary data, and there's no such thing as a
71surrogate pair or half. -->
72
73---
74
75Now, starting with J8 strings, we define the formats JSON8:
76
77 { name: "Alice",
78 signature: b'\y01 ... \yff', // binary data
79 }
80
81J8 Lines:
82
83 doc/hello.md
84 "doc/with spaces.md"
85 b'doc/with byte \yff.md'
86
87and TSV8:
88
89 !tsv8 size name
90 !type Int Str
91 42 doc/hello.md
92 55 "doc/with spaces.md"
93 99 b'doc/with byte \yff.md'
94
95Together, these are called *J8 Notation*.
96
97(JSON8 and TSV8 are still to be fully implemented in Oils.).
98
99## Goals
100
1011. Fix the **JSON-Unix mismatch**: all text formats should be able to express
102 byte strings.
103 - But it's OK to use plain JSON in Oils, e.g. when filenames are known to be
104 strings.
1051. Provide an option to avoid the surrogate pair / **UTF-16 legacy** of JSON.
1061. Allow expressing metadata about **strings vs. bytes**.
1071. Turn TSV into an **exterior** [data
108 frame](https://www.oilshell.org/blog/2018/11/30.html) format.
109 - Unix tools like `awk`, `cut`, and `sort` already understand tables
110 informally.
111
112<!--
113 - TSV8 cells can represent arbitrary binary data, including tabs and
114 newlines.
115-->
116
117Non-goals:
118
1191. "Replace" JSON. JSON8 is backward compatible with JSON, and sometimes the
120 lossy encoding is OK.
1211. Resolve the strings vs. bytes dilemma in all situations.
122 - Like JSON, our spec is **syntactic**. We don't specify a mapping from J8
123 strings to interior data types in any particular language.
124
125<!--
126## J8 Notation in As Few Words As Possible
127
128J8 Strings are a superset of JSON strings:
129
130Only valid unicode:
131
132<pre style="font-size: x-large;">
133u'hi &#x1f926; \u{1f926}' &rarr; hi &#x1f926; &#x1f926;
134</pre>
135
136JSON: unicode + surrogate halves:
137
138<pre style="font-size: x-large;">
139 "hi &#x1f926; \ud83e\udd26" &rarr; hi &#x1f926; &#x1f926;
140 "\ud83e"
141</pre>
142
143Any byte string:
144
145<pre style="font-size: x-large;">
146b'hi &#x1f926; \u{1f926} \yf0\y9f\ya4\ya6' &rarr; hi &#x1f926; &#x1f926; &#x1f926;
147b'\yff'
148</pre>
149
150## Structured Formats
151
152### JSON8
153
154### TSV8
155
1561. Required first row with column names
1571. Optional second row with column types
1581. Gutter Column
159
160-->
161
162## Reference
163
164See the [Data Notation Table of Contents](ref/toc-data.html) in the [Oils
165Reference](ref/index.html).
166
167### TODO / Diagrams
168
169- Diagram of Evolution
170 - JSON strings &rarr; J8 Strings
171 - J8 strings as a building block &rarr; JSON8 and TSV8
172- Venn Diagrams of Data Language Relationships
173 - If you add the left "gutter" column, every TSV is valid TSV8.
174 - Every TSV8 is also syntactically valid TSV. For example, you can import it
175 into a spreadsheet, and remove/ignore the gutter column and type row.
176 - TODO: make a screenshot and test it
177- Doc: How to turn a JSON library into a J8 Notation library.
178 - Issue: an interior type that can represent byte strings.
179
180## J8 Strings - Unicode and bytes
181
182Let's review JSON strings, and then describe J8 strings.
183
184### Review of JSON strings
185
186JSON strings are enclosed in double quotes, and may have these escape
187sequences:
188
189 \" \\ \/
190 \b \f \n \r \t
191 \u1234
192
193Properties of JSON:
194
195- The encoded form must also be valid UTF-8.
196- The encoded form can't contain literal control characters, including literal
197 tabs or newlines. (This is good for TSV8, because it means a literal tab is
198 always a field separator.)
199
200### J8 Description
201
202There are 3 **styles** of J8 strings:
203
2041. JSON strings `""`
2051. `b''` strings
2061. `u''` strings
207
208`b''` strings have these escapes:
209
210 \yff # byte escape
211 \u{1f926} # code point escape. UTF-16 escapes like \u1234
212 # are ILLEGAL
213 \' # single quote, in addition to \"
214 \" \\ \/ # same as JSON
215 \b \f \n \r \t
216
217(JSON-style double-quoted strings remain the same in J8 Notation; they do not
218add the `\'` escape.)
219
220Examples:
221
222 b''
223 b'hello'
224 b'\\'
225 b'"double" \'single\''
226 b'nul byte \y00, unicode \u{1f642}'
227
228`u''` strings have all the same escapes, but **not** `\yff`. This implies that
229they're always valid unicode strings. (If JSON-style `\u1234` escapes were
230allowed, they wouldn't be.)
231
232Examples:
233
234 u''
235 u'hello'
236 u'unicode string \u{1f642}'
237
238A string *without* a prefix, like `'foo'`, is equivalent to `u'foo'`:
239
240 'this is a u string' # discouraged, unless the context is clear
241
242 u'this is a u string' # better to be explicit
243
244### What's representable by each style?
245
246<style>
247#subset {
248 text-align: center;
249 background-color: #DEE;
250 padding-top: 0.5em; padding-bottom: 0.5em;
251 margin-left: 3em; margin-right: 3em;
252}
253.set {
254 font-size: x-large;
255}
256</style>
257
258These relationships might help you understand the 3 styles of strings:
259
260<div id="subset">
261
262<span class="set">Strings representable by `u''`</span><br/>
263&equals; All Unicode Strings (no more and no less)
264
265<b>&subset;</b>
266
267<span class="set">Strings representable by `""`</span> (JSON-style)<br/>
268&equals; All Unicode Strings <b>&cup;</b> Surrogate Half Errors
269
270<b>&subset;</b>
271
272<span class="set">Strings representable by `b''`</span></br>
273&equals; All Byte Strings
274
275</div>
276
277Examples:
278
279- The JSON message `"\udd26"` represents a string that's not Unicode &mdash; it
280 has a surrogate half error. This string is **not** representable with `u''`
281 strings.
282- The J8 message `b'\yff'` represents a byte string. This string is **not**
283 representable with JSON strings or `u''` strings.
284
285### Assymmetry of Encoders and Decoders
286
287A few things to notice about J8 **encoders**:
288
2891. They can emit only `""` strings, possibly using the Unicode replacement char
290 `U+FFFD`. This is a strict JSON encoder.
2911. They *must* emit `b''` strings to preserve all information, because `U+FFFD`
292 replacement is lossy.
2931. They *never* need to emit `u''` strings.
294 - This is because `""` strings (and `b''` strings) can represent all values
295 that `u''` strings can. Still, `u''` strings may be desirable in some
296 situations, like when you want `\u{1f642}` escapes, or to assert that a
297 value must be a valid Unicode string.
298
299On the other hand, J8 **decoders** must accept all 3 kinds of strings.
300
301### YSH has 2 of the 3 styles
302
303A nice property of YSH is that the `u''` and `b''` strings are valid code:
304
305 echo u'hi \u{1f642}' # u respected in YSH, but not OSH
306
307 var myBytes = b'\yff\yfe'
308
309This is useful for correct code generation, and simplifies the language.
310
311But JSON-style strings aren't valid in YSH. The two usages of double quotes
312can't really be reconciled, because JSON looks like `"line\n"` and shell looks
313like `"x = ${myvar}"`.
314
315### J8 Strings vs. POSIX Shell Strings
316
317When the encoded form of a J8 string doesn't contain a **backslash**, it's
318identical to a POSIX shell string.
319
320In this case, it can make sense to omit the `u''` prefix. Example:
321
322<pre>
323shell_string='hi &#x1f642;'
324
325var ysh_str = u'hi &#x1f642;'
326
327var ysh_str = 'hi &#x1f642;' <span class="sh-comment"># same thing</span>
328</pre>
329
330An encoded J8 string has no backslashes when the original string has all these
331properties:
332
3331. Valid Unicode (no non-UTF-8 bytes).
3341. No ASCII control characters. All bytes are `0x20` and greater.
3351. No backslashes or single quotes. (All other required escapes are control
336 characters.)
337
338## JSON8 - Tree-Shaped Records
339
340Now that we've defined J8 strings, we can define JSON8, an obvious extension of
341JSON.
342
343(Not implemented yet.)
344
345### Review of JSON
346
347See <https://json.org>
348
349 [primitive] null true false
350 [number] 42 -1.2e-4
351 [string] "hello\n"
352 [array] [1, 2, 3]
353 [object] {"key": 42}
354
355### JSON8 Description
356
357JSON8 is like JSON, but:
358
3591. All strings can be J8 strings &mdash; one of the **3 styles** describe
360 above.
3611. Object/Dict keys may be **unquoted**, like `{age: 42}`
362 - Unquoted keys must be a valid JS identifier name matching the pattern
363 `[a-zA-Z_][a-zA-Z0-9_]*`.
3641. **Trailing commas** are allowed on objects and arrays: `{"d": 42,}` and `[42,]`
3651. End-of-line comments. We use `#` to be consistent with shell.
366
367<!--
368Note that // is consistent with JavaScript / JSON5, but it actually conflicts
369with Scheme symbols, which we use for NIL8. These are both valid Scheme, and
370probably NIL8:
371
372 (/ 5 3)
373 (// 5 3) # This should not start a comment!
374-->
375
376Example:
377
378```
379{ name: "Bob", // comment
380 age: 30,
381 sig: b'\y00\y01 ... \yff', // trailing comma, binary data
382}
383```
384
385<!--
386!json8 # optional prefix to distinguish from JSON
387
388I think using unquoted keys is a good enough signal, or MIME type.
389
390-->
391
392## J8 Lines - Lines of Text
393
394*J8 Lines* is another format built on J8 strings.
395
396For example, to represent represent 4 filenames, simply use 4 lines:
397
398 dir/my-filename.txt # unquoted string is JS name and . - /
399 "dir/with spaces.txt" # JSON-style
400 b'dir/with bytes \yff.txt' # J8-style
401 u'dir/unicode \u{3bc}'
402
403Literal control characters like newlines are illegal in J8 strings, which means
404that they always occupy **one** physical line.
405
406- Leading spaces on each line are ignored, which allows aligning the quotes.
407- Trailing spaces are also ignored, to aid readability. That is, significant
408 spaces must appear in quotes.
409
410*J8 Lines* can be viewed as a degenerate case of TSV8, described in the next
411section.
412
413<!--
414
415TODO: show grammar, which disallows anything but significant tabs/newlines, and
416insignificant spaces)
417-->
418
419### Related
420
421- <https://jsonlines.org/> - this allows not just strings, any value like `{}`
422 and `[]`
423
424## TSV8 - Table-Shaped Text
425
426Let's review TSV, and then describe TSV8.
427
428### Review of TSV
429
430TSV has a very short specification:
431
432- <https://www.iana.org/assignments/media-types/text/tab-separated-values>
433
434Example:
435
436```
437name<TAB>age
438alice<TAB>44
439bob<TAB>33
440```
441
442Limitations:
443
444- Fields can't contain tabs or newlines.
445- There's no escaping, so unprintable bytes in field values result in an
446 unprintable TSV file.
447- Spaces are easy to confuse with tabs.
448
449### TSV8 Description
450
451TSV8 is like TSV with:
452
4531. A `!tsv8` prefix and required column names.
4542. An optional `!type` line, with types `Bool Int Float Str`.
4553. Other optional column attributes.
4564. Rows of data, each starting with an empty "gutter" column.
457
458Example:
459
460```
461!tsv8 age name
462!type Int Str # optional types
463!other x y # more column metadata
464 44 alice
465 33 bob
466 1 "a\tb"
467 2 b'nul \y00'
468 3 u'unicode \u{3bc}'
469```
470
471Types:
472
473```
474[Bool] false true
475[Int] JSON numbers, restricted to [0-9]+
476[Float] same as JSON
477[Str] J8 string (any of the 3 styles)
478```
479
480Rules for cells:
481
4821. They can be any of 4 forms in J8 Lines:
483 1. Unquoted
484 1. JSON-style `""`
485 1. `u''`
486 1. `b''`
4871. Leading and trailing whitespace must be stripped, as in J8 Lines.
488
489TODO: What about empty cells? Are they equivalent to `null`? TSV apparently
490can't have empty cells, as the rule is `[character]+`, not `[character]+`.
491
492Column attributes:
493
494- `!format` could be Instant / Duration?
495
496### Design Notes
497
498TODO: This section will be filled in as we implement TSV8.
499
500- Null Issues:
501 - Are bools nullable? Seems like no reason, but you could be missing
502 - Are ints nullable? In SQL they probably are
503 - Are floats nullable? Yes, like NA in R.
504 - Decoders can use a parallel typed column to indicate nulls?
505
506- It's OK to use plain TSV in YSH programs as well. You don't have to add
507 types if you don't want to.
508
509
510## Summary
511
512This document described an upgrade of JSON strings:
513
514- J8 Strings (in 3 styles)
515
516And data formats that built on top of these strings:
517
518- JSON8 - tree-shaped records
519- J8 Lines - Unix streams
520- TSV8 - table-shaped data
521
522## Appendix
523
524### Related Links
525
526- <https://json.org/>
527- JSON extensions
528 - <https://json5.org/>
529 - [JSON with Commas and
530 Comments](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
531 - Survey: <https://github.com/json-next/awesome-json-next>
532
533### Future Work
534
535We could have an SEXP8 format for:
536
537- Concrete syntax trees, with location information
538- Textual IRs like WebAssembly
539
540## FAQ
541
542### Why are byte escapes spelled `\yff`, and not `\xff` as in C?
543
544Because in JavaScript and Python, `\xff` is a **code point**, not a byte. That
545is, it's a synonym for `\u00ff`, which is encoded in UTF-8 as the 2 bytes `0xc3
5460xbf`.
547
548This is **exactly** the confusion we want to avoid, so `\yff` is explicitly
549different.
550
551One of Chrome's JSON encoders [also has this
552confusion](https://source.chromium.org/chromium/chromium/src/+/main:base/json/json_reader.h;l=27;drc=d0919138b7951c1a154cf802a68aad7904b6f4c9).
553
554### Why have both `u''` and `b''` strings, if only `b''` is technically needed?
555
556A few reasons:
557
5581. Apps in languages like Python and Rust could make use of the distinction.
559 Oils doesn't have a string/bytes distinction (on the "interior"), but many
560 languages do.
5611. Using `u''` strings can avoid hacks like
562 [WTF-8](http://simonsapin.github.io/wtf-8/), which is often required for
563 round-tripping arbitrary JSON messages. Our `u''` strings don't require
564 WTF-8 because they can't represent surrogate halves.
5651. `u''` strings add trivial weight to the spec, since compared to `b''`
566 strings, they simply remove `\yff`. This is true because *encoded* J8 strings
567 must be valid UTF-8.
568
569### Why not use double quotes like `u""` and `b""`?
570
571J8-style strings could have used double quotes. But single quotes make the new
572styles more visually distinct from `""`, and it allows `''` as a synonym for
573`u''`.
574
575Compared to `""` strings, `''` strings don't have a UTF-16 legacy.
576
577### How do I write a J8 encoder and decoder?
578
579The list of errors at [ref/chap-errors.html](ref/chap-errors.html) may be a
580good starting point.
581
582TODO: describe the Oils implementation.
583
584## Glossary
585
586- **J8 Strings** - the building block for JSON8 and TSV8. There are 3 similar
587 syntaxes: `"foo"` and `b'foo'` and `u'foo'`.
588- **JSON strings** - double quoted strings `"foo"`.
589- **J8-style strings** - either `b'foo'` or `u'foo'`.
590
591Formats built on J8 strings:
592
593- **J8 Lines** - unquoted and J8 strings, one per line.
594- **JSON8** - An upgrade of JSON.
595- **TSV8** - An upgrade of TSV.
596
597