lazylex/README.md

OILS / lazylex / README.md View on Github | oilshell.org

150 lines, 121 significant

1	Lazy, Lossless Lexing Libraries
2	===============================
3
4	Right now we're using this Python code to process HTML in Oil docs. Logically,
5	the doc pipeline looks like:
6
7	(Hand-written Markdown with embedded HTML) <br/>
8	→ CommonMark renderer → <br/>
9	(HTML) <br/>
10	→ Filter with lazy lexing → <br/>
11	(HTML) <br/>
12	→ Filter with lazy lexing → <br/>
13	(HTML) <br/>
14	<br/>
15	... repeat N times ... <br/>
16	<br/>
17	(Final HTML) <br/>
18
19	Eventually it would be nice to expand this API design to more formats and make
20	it available to Oil language users.
21
22	<div id="toc">
23	</div>
24
25	## Why?
26
27	- To report good parse errors with location info
28	- To use like `sed` on HTML
29	- To minimize allocations, i.e. don't construct a DOM, and don't even construct
30	substrings
31	- So we don't "lose the stack", unlike callback-based parsing
32	- We get an iterator of events/spans
33	- A layer to build a "lossless syntax tree" on top of.
34
35	## Formats
36
37	### HTML
38
39	HTML has two levels:
40
41	1. The `<>` structure, i.e. tags, the DOCTYPE declaration, comments, and processing
42	instructions
43	2. The `name="value"` attributes inside start tags (and self-closing tags)
44
45	### TSV2
46
47	- This format is designed to be read line-by-line (unlike CSV).
48	- You can skip to any column, and optionally decode the field into an Bool,
49	Int, Float, or Str.
50
51	### JSON
52
53	- py-yajl is event-based, but not lazy. And it's a parser, not a lexer.
54	- We could LEX `{}` and `[] and `""` `\` in the first step. This is lexing and
55	not parsing.
56
57
58	## Links
59
60	- [pulldown-cmark][]. This is called a "pull parser" in reference to the *XML
61	Pull Parsing* API at <http://xmlpull.org>. However I would call this a
62	lexer and not a parser.
63	- Hm I think this indicates a need for lossless lexers as well?
64	https://github.com/Byron/pulldown-cmark-to-cmark/blob/master/src/fmt.rs
65	- It appears to be used in mdbook
66	- [simdjson][]: This pasrer is designed for extreme speed, and the first stage
67	of it "lazy" and uses integer indices. (We're only trying to avoid
68	allocation; we're not avoiding branches or using SIMD! We also aim to
69	transform the underlying data, not just parse it.)
70
71	[simdjson]: https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/
72
73	[pulldown-cmark]: https://github.com/raphlinus/pulldown-cmark
74
75	## Design Notes
76
77	### Lessons/Claims
78
79	- You can parse HTML quickly and correctly with regexes! It has a simple
80	syntax that's almost designed for this.
81	- Key point: We're not parsing them only with regexes.
82	- The parser is correct in the sense that its behavior on every input is
83	fully-defined. There are no buffer overflows on edge cases -- that's the
84	magic of regexes and the corresponding state machines. However it doesn't
85	recognize every weird document on the web. It recognizes something close
86	to "well-formed XML" (but it's not XHTML.)
87	- Parsing with spans / integer positions is efficient, composable, and
88	leads to better syntax errors.
89	- Example: spec test format parse errors aren't good. Info is lost.
90	Or ASDL parser? I guess it has some line info.
91	- The API is easier to use than SAX because you're not losing the stack.
92	(Previous transformations used Python's HTMLParser, which is a SAX API.)
93	- It's more efficient than a DOM API. DOM allocates a lot and loses
94	whitespace. (Speed matters when rebuilding the whole site. Each page has
95	multiple stages. It makes the code cleaner to do multiple quick passes, like
96	a compiler.)
97	- In many instances, you can MODIFY the HTML doc correctly without
98	deserializing something. For example, adding `<b>` tags around a line
99	doesn't require unquoting and quoting `>` within them.
100	- Preserving whitespace is useful because I want to use 'diff' to test
101	correctness against the old pipeline.
102	- Python's `pat.match(s, start_pos, end_pos)` is very useful for efficient
103	lexing.
104	- TODO: Convert to re2c to see how it does. Need YYLIMIT.
105	- TODO: Issue of non-greedy matches.
106	- TODO: Issue of unquoting and quoting (escaping).
107	- The triple backtick extension to Markdown (part of CommonMark) is useful.
108	- Although it should generate arbitrary `<div name="val">` instead. This
109	allow natural plugins. You can write `<div>` in Markdown, but it's
110	annoying to manually escape `<` to `>`, e.g. in graphviz or TeX.
111	HTML is analogous to shell. A web site is a directory tree of text files!
112	- It's better than the Snip `-->` syntax, which didn't play well with syntax
113	highlighting.
114	- Composable grammars: Is this the basis for Pulp?
115
116	### Open Problems
117
118	- Python generators (`yield`) make the code more natural, but that's not
119	possible in C or C++. (C++20 has coroutines though.)
120	- Could write a compiler? Would be an excuse to clean up the OPy or mycpp
121	ASTs.
122	- Input handling in C/shell:
123	- Unlike Python's regex engine, libc `regexec()` doesnt have an `end_pos`,
124	requires `NUL` termination.
125	- We're also using `re2c` this way. Can se use `YYLIMIT`?
126	- Simple solution: copy the subrange, or temporarily mutate the buffer (would
127	cause copy-on-write)
128	- Is there a zero-copytehre a
129
130	### Comparisons
131
132	- mdbook (in Rust, using [pulldown-cmark][]). Has Rust plugins.
133	- pandoc. Supports many formats, not just HTML. Has plugins that take an AST
134	in JSON. The AST represents pandoc-flavored Markdown. pandoc differs from
135	Markdown in that it discourages HTML extensions.
136	- ReStructuredText. Sphinx has plugins.
137	- https://www.sphinx-doc.org/en/master/usage/extensions/index.html
138	- XML Starlet (dormant, understands XPath and CSS Starlet)
139	- R's Sweave?
140	- bookdown?
141	- Jupyter notebooks have code and data. Do they have plugins?
142	- Are there tools in node.js?
143	- https://stackoverflow.com/questions/6657216/why-doesnt-node-js-have-a-native-dom
144	- jsdom?
145
146	## To Build On Top
147
148	- CSS selectors
149	- DOM
150