| 1 | Lazy, Lossless Lexing Libraries
|
| 2 | ===============================
|
| 3 |
|
| 4 | Right now we're using this Python code to process HTML in Oil docs. Logically,
|
| 5 | the doc pipeline looks like:
|
| 6 |
|
| 7 | (Hand-written Markdown with embedded HTML) <br/>
|
| 8 | → CommonMark renderer → <br/>
|
| 9 | (HTML) <br/>
|
| 10 | → Filter with lazy lexing → <br/>
|
| 11 | (HTML) <br/>
|
| 12 | → Filter with lazy lexing → <br/>
|
| 13 | (HTML) <br/>
|
| 14 | <br/>
|
| 15 | *... repeat N times ...* <br/>
|
| 16 | <br/>
|
| 17 | (Final HTML) <br/>
|
| 18 |
|
| 19 | Eventually it would be nice to expand this API design to more formats and make
|
| 20 | it available to Oil language users.
|
| 21 |
|
| 22 | <div id="toc">
|
| 23 | </div>
|
| 24 |
|
| 25 | ## Why?
|
| 26 |
|
| 27 | - To report good parse errors with location info
|
| 28 | - To use like `sed` on HTML
|
| 29 | - To minimize allocations, i.e. don't construct a DOM, and don't even construct
|
| 30 | substrings
|
| 31 | - So we don't "lose the stack", unlike callback-based parsing
|
| 32 | - We get an iterator of events/spans
|
| 33 | - A layer to build a "lossless syntax tree" on top of.
|
| 34 |
|
| 35 | ## Formats
|
| 36 |
|
| 37 | ### HTML
|
| 38 |
|
| 39 | HTML has two levels:
|
| 40 |
|
| 41 | 1. The `<>` structure, i.e. tags, the DOCTYPE declaration, comments, and processing
|
| 42 | instructions
|
| 43 | 2. The `name="value"` attributes inside start tags (and self-closing tags)
|
| 44 |
|
| 45 | ### TSV2
|
| 46 |
|
| 47 | - This format is **designed** to be read line-by-line (unlike CSV).
|
| 48 | - You can skip to any column, and optionally decode the field into an Bool,
|
| 49 | Int, Float, or Str.
|
| 50 |
|
| 51 | ### JSON
|
| 52 |
|
| 53 | - py-yajl is event-based, but not lazy. And it's a parser, not a lexer.
|
| 54 | - We could LEX `{}` and `[] and `""` `\` in the first step. This is lexing and
|
| 55 | not parsing.
|
| 56 |
|
| 57 |
|
| 58 | ## Links
|
| 59 |
|
| 60 | - [pulldown-cmark][]. This is called a "pull parser" in reference to the *XML
|
| 61 | Pull Parsing* API at <http://xmlpull.org>. However I would call this a
|
| 62 | *lexer* and not a *parser*.
|
| 63 | - Hm I think this indicates a need for **lossless** lexers as well?
|
| 64 | https://github.com/Byron/pulldown-cmark-to-cmark/blob/master/src/fmt.rs
|
| 65 | - It appears to be used in mdbook
|
| 66 | - [simdjson][]: This pasrer is designed for extreme speed, and the first stage
|
| 67 | of it "lazy" and uses integer indices. (We're only trying to avoid
|
| 68 | allocation; we're not avoiding branches or using SIMD! We also aim to
|
| 69 | transform the underlying data, not just parse it.)
|
| 70 |
|
| 71 | [simdjson]: https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/
|
| 72 |
|
| 73 | [pulldown-cmark]: https://github.com/raphlinus/pulldown-cmark
|
| 74 |
|
| 75 | ## Design Notes
|
| 76 |
|
| 77 | ### Lessons/Claims
|
| 78 |
|
| 79 | - You can parse HTML quickly and correctly with regexes! It has a simple
|
| 80 | syntax that's almost designed for this.
|
| 81 | - Key point: We're not parsing them **only** with regexes.
|
| 82 | - The parser is correct in the sense that its behavior on every input is
|
| 83 | fully-defined. There are no buffer overflows on edge cases -- that's the
|
| 84 | magic of regexes and the corresponding state machines. However it doesn't
|
| 85 | recognize every weird document on the web. It recognizes something close
|
| 86 | to "well-formed XML" (but it's not XHTML.)
|
| 87 | - Parsing with spans / integer positions is efficient, **composable**, and
|
| 88 | leads to better **syntax errors**.
|
| 89 | - Example: spec test format parse errors aren't good. Info is lost.
|
| 90 | Or ASDL parser? I guess it has some line info.
|
| 91 | - The API is easier to use than SAX because you're not losing the stack.
|
| 92 | (Previous transformations used Python's HTMLParser, which is a SAX API.)
|
| 93 | - It's more efficient than a DOM API. DOM allocates a lot and loses
|
| 94 | whitespace. (Speed matters when rebuilding the whole site. Each page has
|
| 95 | multiple stages. It makes the code cleaner to do multiple quick passes, like
|
| 96 | a compiler.)
|
| 97 | - In many instances, you can MODIFY the HTML doc correctly without
|
| 98 | deserializing something. For example, adding `<b>` tags around a line
|
| 99 | doesn't require unquoting and quoting `>` within them.
|
| 100 | - Preserving whitespace is useful because I want to use 'diff' to test
|
| 101 | correctness against the old pipeline.
|
| 102 | - Python's `pat.match(s, start_pos, end_pos)` is very useful for efficient
|
| 103 | lexing.
|
| 104 | - TODO: Convert to re2c to see how it does. Need YYLIMIT.
|
| 105 | - TODO: Issue of non-greedy matches.
|
| 106 | - TODO: Issue of unquoting and quoting (escaping).
|
| 107 | - The triple backtick extension to Markdown (part of CommonMark) is useful.
|
| 108 | - Although it should generate arbitrary `<div name="val">` instead. This
|
| 109 | allow natural plugins. You can write `<div>` in Markdown, but it's
|
| 110 | annoying to manually escape `<` to `>`, e.g. in graphviz or TeX.
|
| 111 | HTML is analogous to shell. A web site is a directory tree of text files!
|
| 112 | - It's better than the Snip `-->` syntax, which didn't play well with syntax
|
| 113 | highlighting.
|
| 114 | - Composable grammars: Is this the basis for Pulp?
|
| 115 |
|
| 116 | ### Open Problems
|
| 117 |
|
| 118 | - Python generators (`yield`) make the code more natural, but that's not
|
| 119 | possible in C or C++. (C++20 has coroutines though.)
|
| 120 | - Could write a compiler? Would be an excuse to clean up the OPy or mycpp
|
| 121 | ASTs.
|
| 122 | - Input handling in C/shell:
|
| 123 | - Unlike Python's regex engine, libc `regexec()` doesnt have an `end_pos`,
|
| 124 | requires `NUL` termination.
|
| 125 | - We're also using `re2c` this way. Can se use `YYLIMIT`?
|
| 126 | - Simple solution: copy the subrange, or temporarily mutate the buffer (would
|
| 127 | cause copy-on-write)
|
| 128 | - Is there a zero-copytehre a
|
| 129 |
|
| 130 | ### Comparisons
|
| 131 |
|
| 132 | - mdbook (in Rust, using [pulldown-cmark][]). Has Rust plugins.
|
| 133 | - pandoc. Supports many formats, not just HTML. Has plugins that take an AST
|
| 134 | in JSON. The AST represents pandoc-flavored Markdown. pandoc differs from
|
| 135 | Markdown in that it discourages HTML extensions.
|
| 136 | - ReStructuredText. Sphinx has plugins.
|
| 137 | - https://www.sphinx-doc.org/en/master/usage/extensions/index.html
|
| 138 | - XML Starlet (dormant, understands XPath and CSS Starlet)
|
| 139 | - R's Sweave?
|
| 140 | - bookdown?
|
| 141 | - Jupyter notebooks have code and data. Do they have plugins?
|
| 142 | - Are there tools in node.js?
|
| 143 | - https://stackoverflow.com/questions/6657216/why-doesnt-node-js-have-a-native-dom
|
| 144 | - jsdom?
|
| 145 |
|
| 146 | ## To Build On Top
|
| 147 |
|
| 148 | - CSS selectors
|
| 149 | - DOM
|
| 150 |
|