| 1 | Micro Syntax
|
| 2 | ============
|
| 3 |
|
| 4 | Lightweight, polyglot syntax analysis.
|
| 5 |
|
| 6 | Motivations:
|
| 7 |
|
| 8 | - YSH needs syntax highlighters, and this code is a GUIDE to writing one.
|
| 9 | - The lexer should run on its own. Generated parsers like TreeSitter
|
| 10 | require such a lexer. In contrast to recursive descent, grammars can't
|
| 11 | specify lexer modes.
|
| 12 |
|
| 13 | Our own dev tools:
|
| 14 |
|
| 15 | - The Github source viewer is too slow. We want to publish a fast version of
|
| 16 | our source code to view.
|
| 17 | - Our docs need to link to link source code.
|
| 18 | - Github source viewing APPROXIMATE anyway, because they don't execute your
|
| 19 | build; they don't have ENV. They would have to "solve the halting problem"
|
| 20 | - So let's be FAST and approximate, not SLOW and approximate.
|
| 21 |
|
| 22 | - Multiple attempts at this polyglot problem
|
| 23 | - github/semantic in Haskell
|
| 24 | - facebook/pfff -- semgrep heritage
|
| 25 |
|
| 26 | - Aesthetics
|
| 27 | - I don't like noisy keyword highlighting. Just comments and string
|
| 28 | literals looks surprisingly good.
|
| 29 | - Can use this on the blog too.
|
| 30 | - HTML equivalent of showsh, showpy -- quickly jump to definitions
|
| 31 | - I think I can generate better ctags than `devtools/ctags.sh`! It's a simple
|
| 32 | format.
|
| 33 | - I realized that "sloccount" is the same problem as syntax highlighting --
|
| 34 | you exclude comments, whitespace, and lines with only string literals.
|
| 35 | - sloccount is a huge Perl codebase, and we can stop depending on that.
|
| 36 |
|
| 37 | - could be used to spell check comments?
|
| 38 | - look at the tool sed in the PR from Martin
|
| 39 |
|
| 40 | Other:
|
| 41 |
|
| 42 | - Because re2c is fun, and I wanted to experiment with writing it directly.
|
| 43 | - Ideas
|
| 44 | - use this on your blog?
|
| 45 | - embed in a text editor? Can it be incremental?
|
| 46 |
|
| 47 | ## Related
|
| 48 |
|
| 49 | Positively inspired:
|
| 50 |
|
| 51 | - uchex static analysis paper (2016)
|
| 52 | - ctags
|
| 53 |
|
| 54 | (and re2c itself)
|
| 55 |
|
| 56 | Also see my comment on: Rust is the future of JavaScript infrastructure -- you
|
| 57 | need Rust/C++ semantics to be fast. We're using C++ because it's already in
|
| 58 | our codebase, but Rust is probably better for collaboration. (I trust myself
|
| 59 | to use ASAN and develop with it on, but I don't want to review other people's
|
| 60 | code who haven't used ASAN :-P )
|
| 61 |
|
| 62 |
|
| 63 | Negatively inspired:
|
| 64 |
|
| 65 | - Github source viewer
|
| 66 | - tree-sitter-bash, and to some degree seeing semgrep using tree-sitter-bash
|
| 67 | - huge amount of Perl code in sloccount
|
| 68 | - to some extent, also ctags -- low-level C code
|
| 69 |
|
| 70 | ## TODO
|
| 71 |
|
| 72 | - `--long-flags` in C++, probably
|
| 73 | - Export to parser combinators
|
| 74 | - Export to ctags
|
| 75 |
|
| 76 | ## Algorithm Notes
|
| 77 |
|
| 78 | Two pass algorithm with StartLine:
|
| 79 |
|
| 80 | First pass:
|
| 81 |
|
| 82 | - Lexer modes with no lookahead or lookbehind
|
| 83 | - This is "Pre-structuring", as we do in Oils!
|
| 84 |
|
| 85 | Second pass:
|
| 86 |
|
| 87 | - Python - StartLine WS -> Indent/Dedent
|
| 88 | - C++ - StartLine MaybePreproc LineCont -> preprocessor
|
| 89 |
|
| 90 | Q: Are here docs first pass or second pass?
|
| 91 |
|
| 92 | TODO:
|
| 93 |
|
| 94 | - C++
|
| 95 | - arbitrary raw strings R"zZXx(
|
| 96 | - Shell
|
| 97 | - YSH multi-line strings
|
| 98 |
|
| 99 | Parsing:
|
| 100 |
|
| 101 | - Name tokens should also have contents?
|
| 102 | - at least for Python and C++
|
| 103 | - shell: we want these at start of line:
|
| 104 | - proc X, func X, f()
|
| 105 | - not echo proc X
|
| 106 | - Some kind of parser combinator library to match definitions
|
| 107 | - like showpy, showsh, but you can export to HTML with line numbers, and
|
| 108 | anchor
|
| 109 |
|
| 110 | ### Design Question
|
| 111 |
|
| 112 | - can they be made incremental?
|
| 113 | - run on every keystroke? Supposedly IntelliJ does that.
|
| 114 | - <https://www.jetbrains.com/help/resharper/sdk/ImplementingLexers.html#strongly-typed-lexers>
|
| 115 | - but if you reuse Python's lexer, it's probably not incremental
|
| 116 | - see Python's tokenize.py
|
| 117 |
|
| 118 | ## Notes
|
| 119 |
|
| 120 | Why not reuse off-the-shelf tools?
|
| 121 |
|
| 122 | 1. Because we are a POLYGLOT codebase.
|
| 123 | 1. Because we care about speed. (e.g. Github's source viewer is super slow
|
| 124 | now!)
|
| 125 | - and I think we can do a little bit better that `devtools/ctags.sh`.
|
| 126 | - That is, we can generate a better tags file.
|
| 127 |
|
| 128 | We output 2 things:
|
| 129 |
|
| 130 | 1. A list of spans
|
| 131 | - type. TODO: see Vim and textmate types: comment, string, definition
|
| 132 | - location: line, begin:end col
|
| 133 | 2. A list of "OTAGS"
|
| 134 | - SYMBOL FILENAME LINE
|
| 135 | - generate ctags from this
|
| 136 | - generate HTML or JSON from this
|
| 137 | - recall Woboq code browser was entirely static, in C++
|
| 138 | - they used `compile_commands.json`
|
| 139 |
|
| 140 | - Leaving out VARIABLES, because those are local.
|
| 141 | - I think the 'use' lexer is dynamic, sort of like it is in Vim.
|
| 142 | - 'find uses' can be approximated with `grep -n`? I think that simplifies
|
| 143 | things a lot
|
| 144 | - it's a good practice for code to be greppable
|
| 145 |
|
| 146 | ### Languages
|
| 147 |
|
| 148 | Note: All our source code, and generated Python and C++ code, should be lexable
|
| 149 | like this. Put it all in `src-tree.wwz`.
|
| 150 |
|
| 151 | - Shell:
|
| 152 | - comments
|
| 153 | - `'' "" $''` string literals
|
| 154 | - here docs
|
| 155 | - functions
|
| 156 | - understand `{ }` matching?
|
| 157 |
|
| 158 | - YSH
|
| 159 | - strings `j""`
|
| 160 | - multiline strings `''' """ j"""`
|
| 161 | - proc def
|
| 162 | - func def
|
| 163 |
|
| 164 | - Python
|
| 165 | - # comments
|
| 166 | - `"" ''` strings
|
| 167 | - multi-line strings
|
| 168 | - these may require INDENT/DEDENT tokens
|
| 169 | - class
|
| 170 | - def
|
| 171 | - does it understand `state.Mem`? Probably
|
| 172 | - vim only understands `Mem` though. We might be able to convince it to.
|
| 173 | - Reference:
|
| 174 | - We may also need a fast whole-file lexer for `var_name` and `package.Var`,
|
| 175 | which does dynamic lookup.
|
| 176 |
|
| 177 | - C++
|
| 178 | - `//` comments
|
| 179 | - `/* */` comments
|
| 180 | - preprocessor `#if #define`
|
| 181 | - multi-line strings in generated code
|
| 182 | - Parsing:
|
| 183 | - `class` declarations, with method declarations
|
| 184 | - function declarations (prototypes)
|
| 185 | - these are a bit hard - do they require parsing?
|
| 186 | - function and method definition
|
| 187 | - including templates?
|
| 188 |
|
| 189 | - ASDL
|
| 190 | - # comments
|
| 191 | - I guess every single type can have a line number
|
| 192 | - it shouldn't jump to Python file
|
| 193 | - `value_e.Str` and `value.Str` and `value_t` can jump to the right
|
| 194 | definition
|
| 195 |
|
| 196 | - R # comments and "\n" strings
|
| 197 |
|
| 198 | ### More languages
|
| 199 |
|
| 200 | - JS // and `/* */` and `` for templates
|
| 201 | - CSS `/* */`
|
| 202 | - there's no real symbols to extract here
|
| 203 | - YAML - `#` and strings
|
| 204 | - there's no parsing, just highlighting
|
| 205 | - Markdown
|
| 206 | - the headings would be nice -- other stuff is more complex
|
| 207 | - the `==` and `--` styles require lookahead; they're not line-based
|
| 208 | - so it needs a different model than `ScanOne()`
|
| 209 |
|
| 210 | - spec tests
|