| 1 | ---
|
| 2 | default_highlighter: oils-sh
|
| 3 | ---
|
| 4 |
|
| 5 | J8 Notation - Fixing the JSON-Unix Mismatch
|
| 6 | ===========
|
| 7 |
|
| 8 | J8 Notation is a set of text interchange formats. It's a syntax for:
|
| 9 |
|
| 10 | 1. **strings** / bytes
|
| 11 | 1. tree-shaped **records** (like [JSON]($xref))
|
| 12 | 1. line-based **streams** (like Unix)
|
| 13 | 1. **tables** (like TSV)
|
| 14 |
|
| 15 | It's part of the Oils project, and is intended to solve the *JSON-Unix
|
| 16 | Mismatch*: the Unix kernel deals with bytes, while JSON deals with Unicode
|
| 17 | strings (plus UTF-16 errors).
|
| 18 |
|
| 19 | It's backward compatible with [JSON]($xref), and built on top of
|
| 20 | it.
|
| 21 |
|
| 22 | But just like JSON isn't only for JavaScript, J8 Notation isn't only for Oils.
|
| 23 | Any language understands JSON should also understand J8 Notation.
|
| 24 |
|
| 25 | (Note: J8 replaced the similar [QSN](qsn.html) design in January
|
| 26 | 2024. QSN wasn't as compatible with both JSON and YSH code.)
|
| 27 |
|
| 28 | <div id="toc">
|
| 29 | </div>
|
| 30 |
|
| 31 | ## Quick Picture
|
| 32 |
|
| 33 | <style>
|
| 34 | .uni4 {
|
| 35 | /* color: #111; */
|
| 36 | }
|
| 37 | .dq {
|
| 38 | color: darkred;
|
| 39 | }
|
| 40 | .sq {
|
| 41 | color: #111;
|
| 42 | }
|
| 43 | </style>
|
| 44 |
|
| 45 | There are 3 styles of J8 strings:
|
| 46 |
|
| 47 | <pre style="font-size: x-large;">
|
| 48 | <span class=dq>"</span>hi 🙂 \u<span class=uni4>D83D</span>\u<span class=uni4>DE42</span><span class=dq>"</span> <span class="sh-comment"># JSON-style, with surrogate pair</span>
|
| 49 |
|
| 50 | <span class=sq>b'</span>hi 🙂 \yF0\y9F\y99\y82<span class=sq>'</span> <span class="sh-comment"># Can be ANY bytes, including UTF-8</span>
|
| 51 |
|
| 52 | <span class=sq>u'</span>hi 🙂 \u{1F642}<span class=sq>'</span> <span class="sh-comment"># nice alternative syntax</span>
|
| 53 | </pre>
|
| 54 |
|
| 55 | They all denote the same decoded string — "hi" and two `U+1F642` smiley
|
| 56 | faces:
|
| 57 |
|
| 58 | <pre style="font-size: x-large;">
|
| 59 | hi 🙂 🙂
|
| 60 | </pre>
|
| 61 |
|
| 62 | Why did we add these `u''` and `b''` strings?
|
| 63 |
|
| 64 | - We want to represent any string that a Unix kernel can emit (`argv` arrays,
|
| 65 | env variables, filenames, file contents, etc.)
|
| 66 | - J8 encoders emit `b''` strings to avoid losing information.
|
| 67 | - `u''` strings are like `b''` strings, but they can only express valid
|
| 68 | Unicode strings.
|
| 69 |
|
| 70 | <!-- They can't express arbitrary binary data, and there's no such thing as a
|
| 71 | surrogate pair or half. -->
|
| 72 |
|
| 73 | ---
|
| 74 |
|
| 75 | Now, starting with J8 strings, we define the formats JSON8:
|
| 76 |
|
| 77 | { name: "Alice",
|
| 78 | signature: b'\y01 ... \yff', # binary data
|
| 79 | }
|
| 80 |
|
| 81 | J8 Lines:
|
| 82 |
|
| 83 | doc/hello.md
|
| 84 | "doc/with spaces.md"
|
| 85 | b'doc/with byte \yff.md'
|
| 86 |
|
| 87 | and TSV8:
|
| 88 |
|
| 89 | !tsv8 size name
|
| 90 | !type Int Str
|
| 91 | 42 doc/hello.md
|
| 92 | 55 "doc/with spaces.md"
|
| 93 | 99 b'doc/with byte \yff.md'
|
| 94 |
|
| 95 | Together, these are called *J8 Notation*.
|
| 96 |
|
| 97 | (JSON8 and TSV8 are still to be fully implemented in Oils.).
|
| 98 |
|
| 99 | ## Goals
|
| 100 |
|
| 101 | 1. Fix the **JSON-Unix mismatch**: all text formats should be able to express
|
| 102 | byte strings.
|
| 103 | - But it's OK to use plain JSON in Oils, e.g. when filenames are known to be
|
| 104 | strings.
|
| 105 | 1. Provide an option to avoid the surrogate pair / **UTF-16 legacy** of JSON.
|
| 106 | 1. Allow expressing metadata about **strings vs. bytes**.
|
| 107 | 1. Turn TSV into an **exterior** [data
|
| 108 | frame](https://www.oilshell.org/blog/2018/11/30.html) format.
|
| 109 | - Unix tools like `awk`, `cut`, and `sort` already understand tables
|
| 110 | informally.
|
| 111 |
|
| 112 | <!--
|
| 113 | - TSV8 cells can represent arbitrary binary data, including tabs and
|
| 114 | newlines.
|
| 115 | -->
|
| 116 |
|
| 117 | Non-goals:
|
| 118 |
|
| 119 | 1. "Replace" JSON. JSON8 is backward compatible with JSON, and sometimes the
|
| 120 | lossy encoding is OK.
|
| 121 | 1. Resolve the strings vs. bytes dilemma in all situations.
|
| 122 | - Like JSON, our spec is **syntactic**. We don't specify a mapping from J8
|
| 123 | strings to interior data types in any particular language.
|
| 124 |
|
| 125 | <!--
|
| 126 | ## J8 Notation in As Few Words As Possible
|
| 127 |
|
| 128 | J8 Strings are a superset of JSON strings:
|
| 129 |
|
| 130 | Only valid unicode:
|
| 131 |
|
| 132 | <pre style="font-size: x-large;">
|
| 133 | u'hi 🤦 \u{1f926}' → hi 🤦 🤦
|
| 134 | </pre>
|
| 135 |
|
| 136 | JSON: unicode + surrogate halves:
|
| 137 |
|
| 138 | <pre style="font-size: x-large;">
|
| 139 | "hi 🤦 \ud83e\udd26" → hi 🤦 🤦
|
| 140 | "\ud83e"
|
| 141 | </pre>
|
| 142 |
|
| 143 | Any byte string:
|
| 144 |
|
| 145 | <pre style="font-size: x-large;">
|
| 146 | b'hi 🤦 \u{1f926} \yf0\y9f\ya4\ya6' → hi 🤦 🤦 🤦
|
| 147 | b'\yff'
|
| 148 | </pre>
|
| 149 |
|
| 150 | ## Structured Formats
|
| 151 |
|
| 152 | ### JSON8
|
| 153 |
|
| 154 | ### TSV8
|
| 155 |
|
| 156 | 1. Required first row with column names
|
| 157 | 1. Optional second row with column types
|
| 158 | 1. Gutter Column
|
| 159 |
|
| 160 | -->
|
| 161 |
|
| 162 | ## Reference
|
| 163 |
|
| 164 | See the [Data Notation Table of Contents](ref/toc-data.html) in the [Oils
|
| 165 | Reference](ref/index.html).
|
| 166 |
|
| 167 | ### TODO / Diagrams
|
| 168 |
|
| 169 | - Diagram of Evolution
|
| 170 | - JSON strings → J8 Strings
|
| 171 | - J8 strings as a building block → JSON8 and TSV8
|
| 172 | - Venn Diagrams of Data Language Relationships
|
| 173 | - If you add the left "gutter" column, every TSV is valid TSV8.
|
| 174 | - Every TSV8 is also syntactically valid TSV. For example, you can import it
|
| 175 | into a spreadsheet, and remove/ignore the gutter column and type row.
|
| 176 | - TODO: make a screenshot and test it
|
| 177 | - Doc: How to turn a JSON library into a J8 Notation library.
|
| 178 | - Issue: an interior type that can represent byte strings.
|
| 179 |
|
| 180 | ## J8 Strings - Unicode and bytes
|
| 181 |
|
| 182 | Let's review JSON strings, and then describe J8 strings.
|
| 183 |
|
| 184 | ### Review of JSON strings
|
| 185 |
|
| 186 | JSON strings are enclosed in double quotes, and may have these escape
|
| 187 | sequences:
|
| 188 |
|
| 189 | \" \\ \/
|
| 190 | \b \f \n \r \t
|
| 191 | \u1234
|
| 192 |
|
| 193 | Properties of JSON:
|
| 194 |
|
| 195 | - The encoded form must also be valid UTF-8.
|
| 196 | - The encoded form can't contain literal control characters, including literal
|
| 197 | tabs or newlines. (This is good for TSV8, because it means a literal tab is
|
| 198 | always a field separator.)
|
| 199 |
|
| 200 | ### J8 Description
|
| 201 |
|
| 202 | There are 3 **styles** of J8 strings:
|
| 203 |
|
| 204 | 1. JSON strings `j""`, which may be written `""`
|
| 205 | 1. `b''` strings
|
| 206 | 1. `u''` strings
|
| 207 |
|
| 208 | `b''` strings have these escapes:
|
| 209 |
|
| 210 | \yff # byte escape
|
| 211 | \u{1f926} # code point escape. UTF-16 escapes like \u1234
|
| 212 | # are ILLEGAL
|
| 213 | \' # single quote, in addition to \"
|
| 214 | \" \\ \/ # same as JSON
|
| 215 | \b \f \n \r \t
|
| 216 |
|
| 217 | (JSON-style double-quoted do not add the `\'` escape. Except for the optional
|
| 218 | `j` prefix, they remain the same.)
|
| 219 |
|
| 220 | Examples:
|
| 221 |
|
| 222 | b''
|
| 223 | b'hello'
|
| 224 | b'\\'
|
| 225 | b'"double" \'single\''
|
| 226 | b'nul byte \y00, unicode \u{1f642}'
|
| 227 |
|
| 228 | `u''` strings have all the same escapes, but **not** `\yff`. This implies that
|
| 229 | they're always valid unicode strings. (If JSON-style `\u1234` escapes were
|
| 230 | allowed, they wouldn't be.)
|
| 231 |
|
| 232 | Examples:
|
| 233 |
|
| 234 | u''
|
| 235 | u'hello'
|
| 236 | u'unicode string \u{1f642}'
|
| 237 |
|
| 238 | A string *without* a prefix, like `'foo'`, is equivalent to `u'foo'`:
|
| 239 |
|
| 240 | 'this is a u string' # discouraged, unless the context is clear
|
| 241 |
|
| 242 | u'this is a u string' # better to be explicit
|
| 243 |
|
| 244 | ### What's representable by each style?
|
| 245 |
|
| 246 | <style>
|
| 247 | #subset {
|
| 248 | text-align: center;
|
| 249 | background-color: #DEE;
|
| 250 | padding-top: 0.5em; padding-bottom: 0.5em;
|
| 251 | margin-left: 3em; margin-right: 3em;
|
| 252 | }
|
| 253 | .set {
|
| 254 | font-size: x-large;
|
| 255 | }
|
| 256 | </style>
|
| 257 |
|
| 258 | These relationships might help you understand the 3 styles of strings:
|
| 259 |
|
| 260 | <div id="subset">
|
| 261 |
|
| 262 | <span class="set">Strings representable by `u''`</span><br/>
|
| 263 | = All Unicode Strings (no more and no less)
|
| 264 |
|
| 265 | <b>⊂</b>
|
| 266 |
|
| 267 | <span class="set">Strings representable by `""`</span> (JSON-style)<br/>
|
| 268 | = All Unicode Strings <b>∪</b> Surrogate Half Errors
|
| 269 |
|
| 270 | <b>⊂</b>
|
| 271 |
|
| 272 | <span class="set">Strings representable by `b''`</span></br>
|
| 273 | = All Byte Strings
|
| 274 |
|
| 275 | </div>
|
| 276 |
|
| 277 | Examples:
|
| 278 |
|
| 279 | - The JSON message `"\udd26"` represents a string that's not Unicode — it
|
| 280 | has a surrogate half error. This string is **not** representable with `u''`
|
| 281 | strings.
|
| 282 | - The J8 message `b'\yff'` represents a byte string. This string is **not**
|
| 283 | representable with JSON strings or `u''` strings.
|
| 284 |
|
| 285 | ### Assymmetry of Encoders and Decoders
|
| 286 |
|
| 287 | A few things to notice about J8 **encoders**:
|
| 288 |
|
| 289 | 1. They can emit only `""` strings, possibly using the Unicode replacement char
|
| 290 | `U+FFFD`. This is a strict JSON encoder.
|
| 291 | 1. They *must* emit `b''` strings to preserve all information, because `U+FFFD`
|
| 292 | replacement is lossy.
|
| 293 | 1. They *never* need to emit `u''` strings.
|
| 294 | - This is because `""` strings (and `b''` strings) can represent all values
|
| 295 | that `u''` strings can. Still, `u''` strings may be desirable in some
|
| 296 | situations, like when you want `\u{1f642}` escapes, or to assert that a
|
| 297 | value must be a valid Unicode string.
|
| 298 |
|
| 299 | On the other hand, J8 **decoders** must accept all 3 kinds of strings.
|
| 300 |
|
| 301 | ### YSH has 2 of the 3 styles
|
| 302 |
|
| 303 | A nice property of YSH is that the `u''` and `b''` strings are valid code:
|
| 304 |
|
| 305 | echo u'hi \u{1f642}' # u respected in YSH, but not OSH
|
| 306 |
|
| 307 | var myBytes = b'\yff\yfe'
|
| 308 |
|
| 309 | This is useful for correct code generation, and simplifies the language.
|
| 310 |
|
| 311 | But JSON-style strings aren't valid in YSH. The two usages of double quotes
|
| 312 | can't really be reconciled, because JSON looks like `"line\n"` and shell looks
|
| 313 | like `"x = ${myvar}"`.
|
| 314 |
|
| 315 | ### J8 Strings vs. POSIX Shell Strings
|
| 316 |
|
| 317 | When the encoded form of a J8 string doesn't contain a **backslash**, it's
|
| 318 | identical to a POSIX shell string.
|
| 319 |
|
| 320 | In this case, it can make sense to omit the `u''` prefix. Example:
|
| 321 |
|
| 322 | <pre>
|
| 323 | shell_string='hi 🙂'
|
| 324 |
|
| 325 | var ysh_str = u'hi 🙂'
|
| 326 |
|
| 327 | var ysh_str = 'hi 🙂' <span class="sh-comment"># same thing</span>
|
| 328 | </pre>
|
| 329 |
|
| 330 | An encoded J8 string has no backslashes when the original string has all these
|
| 331 | properties:
|
| 332 |
|
| 333 | 1. Valid Unicode (no non-UTF-8 bytes).
|
| 334 | 1. No ASCII control characters. All bytes are `0x20` and greater.
|
| 335 | 1. No backslashes or single quotes. (All other required escapes are control
|
| 336 | characters.)
|
| 337 |
|
| 338 |
|
| 339 | ## JSON8 - Tree-Shaped Records
|
| 340 |
|
| 341 | Now that we've defined J8 strings, we can define JSON8, an obvious extension of
|
| 342 | JSON.
|
| 343 |
|
| 344 | (Not implemented yet.)
|
| 345 |
|
| 346 | ### Review of JSON
|
| 347 |
|
| 348 | See <https://json.org>
|
| 349 |
|
| 350 | [primitive] null true false
|
| 351 | [number] 42 -1.2e-4
|
| 352 | [string] "hello\n"
|
| 353 | [array] [1, 2, 3]
|
| 354 | [object] {"key": 42}
|
| 355 |
|
| 356 | ### JSON8 Description
|
| 357 |
|
| 358 | JSON8 is like JSON, but:
|
| 359 |
|
| 360 | 1. All strings can be J8 strings — one of the **3 styles** describe
|
| 361 | above.
|
| 362 | 1. Object/Dict keys may be **unquoted**, like `{age: 42}`
|
| 363 | - Unquoted keys must be a valid JS identifier name matching the pattern
|
| 364 | `[a-zA-Z_][a-zA-Z0-9_]*`.
|
| 365 | 1. **Trailing commas** are allowed on objects and arrays: `{"d": 42,}` and `[42,]`
|
| 366 | 1. End-of-line comments. We use `#` to be consistent with shell.
|
| 367 |
|
| 368 | <!--
|
| 369 | Note that // is consistent with JavaScript / JSON5, but it actually conflicts
|
| 370 | with Scheme symbols, which we use for NIL8. These are both valid Scheme, and
|
| 371 | probably NIL8:
|
| 372 |
|
| 373 | (/ 5 3)
|
| 374 | (// 5 3) # This should not start a comment!
|
| 375 | -->
|
| 376 |
|
| 377 | Example:
|
| 378 |
|
| 379 | ```
|
| 380 | { name: "Bob", # comment
|
| 381 | age: 30,
|
| 382 | sig: b'\y00\y01 ... \yff', # trailing comma, binary data
|
| 383 | }
|
| 384 | ```
|
| 385 |
|
| 386 | <!--
|
| 387 | !json8 # optional prefix to distinguish from JSON
|
| 388 |
|
| 389 | I think using unquoted keys is a good enough signal, or MIME type.
|
| 390 |
|
| 391 | -->
|
| 392 |
|
| 393 | ## J8 Lines - Lines of Text
|
| 394 |
|
| 395 | *J8 Lines* is another format built on J8 strings. Each line is either:
|
| 396 |
|
| 397 | 1. An unquoted string, which must be valid UTF-8. Whitespace is allowed, but
|
| 398 | not other ASCII control chars.
|
| 399 | 2. A quoted J8 string (JSON style `""` or J8-style `b'' u''`)
|
| 400 | 3. An **ignored** empty line
|
| 401 |
|
| 402 | In all cases, leading and trailing whitespace is ignored.
|
| 403 |
|
| 404 | ---
|
| 405 |
|
| 406 | For example, 6 strings with weird characters could be represented like this:
|
| 407 |
|
| 408 | dir/with spaces.txt # unquoted string must be UTF-8
|
| 409 | "dir/with newline \n.txt" # JSON-style
|
| 410 | b'dir/with bytes \yff.txt' # J8-style
|
| 411 | u'dir/unicode \u{3bc}'
|
| 412 | # ignored empty line
|
| 413 | '' # empty string, not ignored
|
| 414 | 'dir/unicode \u{3bc}' # no prefix implies u''
|
| 415 |
|
| 416 | Note that J8 strings always occupy **one** physical line, because they can't
|
| 417 | contain unescaped control characters, including newlines.
|
| 418 |
|
| 419 | *J8 Lines* can be viewed as a simpler case of TSV8, described in the next
|
| 420 | section.
|
| 421 |
|
| 422 | <!--
|
| 423 |
|
| 424 | TODO: show grammar, which disallows anything but significant tabs/newlines, and
|
| 425 | insignificant spaces)
|
| 426 | -->
|
| 427 |
|
| 428 | #### Related
|
| 429 |
|
| 430 | - <https://jsonlines.org/> allows not just strings, but any value like `{}` and
|
| 431 | `[]`. We could define an obvious "JSON8 Lines" format, which is different
|
| 432 | than "J8 Lines".
|
| 433 |
|
| 434 | ## TSV8 - Table-Shaped Text
|
| 435 |
|
| 436 | Let's review TSV, and then describe TSV8.
|
| 437 |
|
| 438 | ### Review of TSV
|
| 439 |
|
| 440 | TSV has a very short specification:
|
| 441 |
|
| 442 | - <https://www.iana.org/assignments/media-types/text/tab-separated-values>
|
| 443 |
|
| 444 | Example:
|
| 445 |
|
| 446 | ```
|
| 447 | name<TAB>age
|
| 448 | alice<TAB>44
|
| 449 | bob<TAB>33
|
| 450 | ```
|
| 451 |
|
| 452 | Limitations:
|
| 453 |
|
| 454 | - Fields can't contain tabs or newlines.
|
| 455 | - There's no escaping, so unprintable bytes in field values result in an
|
| 456 | unprintable TSV file.
|
| 457 | - Spaces are easy to confuse with tabs.
|
| 458 |
|
| 459 | ### TSV8 Description
|
| 460 |
|
| 461 | TSV8 is like TSV with:
|
| 462 |
|
| 463 | 1. A `!tsv8` prefix and required column names.
|
| 464 | 2. An optional `!type` line, with types `Bool Int Float Str`.
|
| 465 | 3. Other optional column attributes.
|
| 466 | 4. Rows of data, each starting with an empty "gutter" column.
|
| 467 |
|
| 468 | Example:
|
| 469 |
|
| 470 | ```
|
| 471 | !tsv8 age name
|
| 472 | !type Int Str # optional types
|
| 473 | !other x y # more column metadata
|
| 474 | 44 alice
|
| 475 | 33 bob
|
| 476 | 1 "a\tb"
|
| 477 | 2 b'nul \y00'
|
| 478 | 3 u'unicode \u{3bc}'
|
| 479 | ```
|
| 480 |
|
| 481 | Types:
|
| 482 |
|
| 483 | ```
|
| 484 | [Bool] false true
|
| 485 | [Int] JSON numbers, restricted to [0-9]+
|
| 486 | [Float] same as JSON
|
| 487 | [Str] J8 string (any of the 3 styles)
|
| 488 | ```
|
| 489 |
|
| 490 | Rules for cells:
|
| 491 |
|
| 492 | 1. They can be any of 4 forms in J8 Lines:
|
| 493 | 1. Unquoted
|
| 494 | 1. JSON-style `""`
|
| 495 | 1. `u''`
|
| 496 | 1. `b''`
|
| 497 | 1. Leading and trailing whitespace must be stripped, as in J8 Lines.
|
| 498 |
|
| 499 | TODO: What about empty cells? Are they equivalent to `null`? TSV apparently
|
| 500 | can't have empty cells, as the rule is `[character]+`, not `[character]+`.
|
| 501 |
|
| 502 | Column attributes:
|
| 503 |
|
| 504 | - `!format` could be Instant / Duration?
|
| 505 |
|
| 506 | ### Design Notes
|
| 507 |
|
| 508 | TODO: This section will be filled in as we implement TSV8.
|
| 509 |
|
| 510 | - Null Issues:
|
| 511 | - Are bools nullable? Seems like no reason, but you could be missing
|
| 512 | - Are ints nullable? In SQL they probably are
|
| 513 | - Are floats nullable? Yes, like NA in R.
|
| 514 | - Decoders can use a parallel typed column to indicate nulls?
|
| 515 |
|
| 516 | - It's OK to use plain TSV in YSH programs as well. You don't have to add
|
| 517 | types if you don't want to.
|
| 518 |
|
| 519 |
|
| 520 | ## Summary
|
| 521 |
|
| 522 | This document described an upgrade of JSON strings:
|
| 523 |
|
| 524 | - J8 Strings (in 3 styles)
|
| 525 |
|
| 526 | And data formats that built on top of these strings:
|
| 527 |
|
| 528 | - JSON8 - tree-shaped records
|
| 529 | - J8 Lines - Unix streams
|
| 530 | - TSV8 - table-shaped data
|
| 531 |
|
| 532 | ## Appendix
|
| 533 |
|
| 534 | ### Related Links
|
| 535 |
|
| 536 | - <https://json.org/>
|
| 537 | - JSON extensions
|
| 538 | - <https://json5.org/>
|
| 539 | - [JSON with Commas and
|
| 540 | Comments](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
|
| 541 | - Survey: <https://github.com/json-next/awesome-json-next>
|
| 542 |
|
| 543 | ### Future Work
|
| 544 |
|
| 545 | We could have an SEXP8 format for:
|
| 546 |
|
| 547 | - Concrete syntax trees, with location information
|
| 548 | - Textual IRs like WebAssembly
|
| 549 |
|
| 550 | ## FAQ
|
| 551 |
|
| 552 | ### Why are byte escapes spelled `\yff`, and not `\xff` as in C?
|
| 553 |
|
| 554 | Because in JavaScript and Python, `\xff` is a **code point**, not a byte. That
|
| 555 | is, it's a synonym for `\u00ff`, which is encoded in UTF-8 as the 2 bytes `0xc3
|
| 556 | 0xbf`.
|
| 557 |
|
| 558 | This is **exactly** the confusion we want to avoid, so `\yff` is explicitly
|
| 559 | different.
|
| 560 |
|
| 561 | One of Chrome's JSON encoders [also has this
|
| 562 | confusion](https://source.chromium.org/chromium/chromium/src/+/main:base/json/json_reader.h;l=27;drc=d0919138b7951c1a154cf802a68aad7904b6f4c9).
|
| 563 |
|
| 564 | ### Why have both `u''` and `b''` strings, if only `b''` is technically needed?
|
| 565 |
|
| 566 | A few reasons:
|
| 567 |
|
| 568 | 1. Apps in languages like Python and Rust could make use of the distinction.
|
| 569 | Oils doesn't have a string/bytes distinction (on the "interior"), but many
|
| 570 | languages do.
|
| 571 | 1. Using `u''` strings can avoid hacks like
|
| 572 | [WTF-8](http://simonsapin.github.io/wtf-8/), which is often required for
|
| 573 | round-tripping arbitrary JSON messages. Our `u''` strings don't require
|
| 574 | WTF-8 because they can't represent surrogate halves.
|
| 575 | 1. `u''` strings add trivial weight to the spec, since compared to `b''`
|
| 576 | strings, they simply remove `\yff`. This is true because *encoded* J8 strings
|
| 577 | must be valid UTF-8.
|
| 578 |
|
| 579 | ### Why not use double quotes like `u""` and `b""`?
|
| 580 |
|
| 581 | J8-style strings could have used double quotes. But single quotes make the new
|
| 582 | styles more visually distinct from `""`, and it allows `''` as a synonym for
|
| 583 | `u''`.
|
| 584 |
|
| 585 | Compared to `""` strings, `''` strings don't have a UTF-16 legacy.
|
| 586 |
|
| 587 | ### How do I write a J8 encoder and decoder?
|
| 588 |
|
| 589 | The list of errors at [ref/chap-errors.html](ref/chap-errors.html) may be a
|
| 590 | good starting point.
|
| 591 |
|
| 592 | TODO: describe the Oils implementation.
|
| 593 |
|
| 594 | ### Should a J8 number be mapped to an Int, Float, or Decimal type?
|
| 595 |
|
| 596 | J8 Notation is like JSON: it only specifies the syntax of messages on the wire.
|
| 597 |
|
| 598 | The mapping of text to types is left to implementers, and depends on the
|
| 599 | programming language:
|
| 600 |
|
| 601 | - Languages like C, C++, and Rust have different sizes of ints and floats
|
| 602 | - Languages like JavaScript favor floats
|
| 603 | - It's valid to map to a Decimal type, if the language runtime supports it
|
| 604 |
|
| 605 | OSH and YSH happen to use `Int` and `Float`, but this is logically separate
|
| 606 | from J8 Notation.
|
| 607 |
|
| 608 | ## Glossary
|
| 609 |
|
| 610 | - **J8 Strings** - the building block for JSON8 and TSV8. There are 3 similar
|
| 611 | syntaxes: `"foo"` and `b'foo'` and `u'foo'`.
|
| 612 | - **JSON strings** - double quoted strings `"foo"`.
|
| 613 | - **J8-style strings** - either `b'foo'` or `u'foo'`.
|
| 614 |
|
| 615 | Formats built on J8 strings:
|
| 616 |
|
| 617 | - **J8 Lines** - unquoted and J8 strings, one per line.
|
| 618 | - **JSON8** - An upgrade of JSON.
|
| 619 | - **TSV8** - An upgrade of TSV.
|
| 620 |
|
| 621 |
|