1 | mycpp
|
2 | =====
|
3 |
|
4 | This is a Python-to-C++ translator based on MyPy. It only
|
5 | handles the small subset of Python that we use in Oils.
|
6 |
|
7 | It's inspired by both mypyc and Shed Skin. These posts give background:
|
8 |
|
9 | - [Brief Descriptions of a Python to C++ Translator](https://www.oilshell.org/blog/2022/05/mycpp.html)
|
10 | - [Oil Is Being Implemented "Middle Out"](https://www.oilshell.org/blog/2022/03/middle-out.html)
|
11 |
|
12 | As of March 2024, the translation to C++ is **done**. So it's no longer
|
13 | experimental!
|
14 |
|
15 | However, it's still pretty **hacky**. This doc exists mainly to explain the
|
16 | hacks. (We may want to rewrite mycpp as "yaks", although it's low priority
|
17 | right now.)
|
18 |
|
19 | ---
|
20 |
|
21 | Source for this doc: [mycpp/README.md]($oils-src). The code is all in
|
22 | [mycpp/]($oils-src).
|
23 |
|
24 |
|
25 | <div id="toc">
|
26 | </div>
|
27 |
|
28 | ## Instructions
|
29 |
|
30 | ### Translating and Compiling `oils-cpp`
|
31 |
|
32 | Running `mycpp` is best done on a Debian / Ubuntu-ish machine. Follow the
|
33 | instructions at <https://github.com/oilshell/oil/wiki/Contributing> to create
|
34 | the "dev build" first, which is DISTINCT from the C++ build. Make sure you can
|
35 | run:
|
36 |
|
37 | oil$ build/py.sh all
|
38 |
|
39 | This will give you a working shell:
|
40 |
|
41 | oil$ bin/osh -c 'echo hi' # running interpreted Python
|
42 | hi
|
43 |
|
44 | To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's
|
45 | dependencies. First install packages:
|
46 |
|
47 | # We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
|
48 | oil$ build/deps.sh install-ubuntu-packages
|
49 |
|
50 | Then fetch data, like the Python 3.10 tarball and MyPy repo:
|
51 |
|
52 | oil$ build/deps.sh fetch
|
53 |
|
54 | Then build from source:
|
55 |
|
56 | oil$ build/deps.sh install-wedges
|
57 |
|
58 | To build oil-native, use:
|
59 |
|
60 | oil$ ./NINJA-config.sh
|
61 | oil$ ninja # translate and compile, may take 30 seconds
|
62 |
|
63 | oil$ _bin/cxx-asan/osh -c 'echo hi' # running compiled C++ !
|
64 | hi
|
65 |
|
66 | To run the tests and benchmarks:
|
67 |
|
68 | oil$ mycpp/TEST.sh test-translator
|
69 | ... 200+ tasks run ...
|
70 |
|
71 | If you have problems, post a message on `#oil-dev` at
|
72 | `https://oilshell.zulipchat.com`. Not many people have contributed to `mycpp`,
|
73 | so I can use your feedback!
|
74 |
|
75 | Related:
|
76 |
|
77 | - [Oil Native Quick
|
78 | Start](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start) on the
|
79 | wiki.
|
80 | - [Oil Dev Cheat Sheet](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start)
|
81 |
|
82 | ## Notes on the Algorithm / Architecture
|
83 |
|
84 | There are four passes over the MyPy AST.
|
85 |
|
86 | (1) `const_pass.py`: Collect string constants
|
87 |
|
88 | Turn turn the constant in `myfunc("foo")` into top-level `GLOBAL_STR(str1,
|
89 | "foo")`.
|
90 |
|
91 | (2) Three passes in `cppgen_pass.py`.
|
92 |
|
93 | (a) Forward Declaration Pass.
|
94 |
|
95 | class Foo;
|
96 | class Bar;
|
97 |
|
98 | This pass also determines which methods should be declared `virtual` in their
|
99 | declarations. The `virtual` keyword is written in the next pass.
|
100 |
|
101 | (b) Declaration Pass.
|
102 |
|
103 | class Foo {
|
104 | void method();
|
105 | };
|
106 | class Bar {
|
107 | void method();
|
108 | };
|
109 |
|
110 | More work in this pass:
|
111 |
|
112 | - Collect member variables and write them at the end of the definition
|
113 | - Collect locals for "hoisting". Written in the next pass.
|
114 | - Creates `fmtN()` functions to compile Python's `%` formatting operator.
|
115 |
|
116 | (c) Definition Pass.
|
117 |
|
118 | void Foo:method() {
|
119 | ...
|
120 | }
|
121 |
|
122 | void Bar:method() {
|
123 | ...
|
124 | }
|
125 |
|
126 | Note: I really wish we were not using visitors, but that's inherited from MyPy.
|
127 |
|
128 | ## mycpp Idioms / "Creative Hacks"
|
129 |
|
130 | Oils is written in typed Python 2. It will run under a stock Python 2
|
131 | interpreter, and it will typecheck with stock MyPy.
|
132 |
|
133 | However, there are a few language features that don't map cleanly from typed
|
134 | Python to C++:
|
135 |
|
136 | - switch statements (unfortunately we don't have the Python 3 match statement)
|
137 | - C++ destructors - the RAII ptatern
|
138 | - casting - MyPy has one kind of cast; C++ has `static_cast` and
|
139 | `reinterpret_cast`. (We don't use C-style casting.)
|
140 |
|
141 | So this describes the idioms we use. There are some hacks in
|
142 | [mycpp/cppgen_pass.py]($oils-src) to handle these cases, and also Python
|
143 | runtime equivalents in `mycpp/mylib.py`.
|
144 |
|
145 | ### `with {,tag,str_}switch` → Switch statement
|
146 |
|
147 | We have three constructs that translate to a C++ switch statement. They use a
|
148 | Python context manager `with Xswitch(obj) ...` as a little hack.
|
149 |
|
150 | Here are examples like the ones in [mycpp/examples/test_switch.py]($oils-src).
|
151 | (`ninja mycpp-logs-equal` translates, compiles, and tests all the examples.)
|
152 |
|
153 | Simple switch:
|
154 |
|
155 | myint = 99
|
156 | with switch(myint) as case:
|
157 | if case(42, 43):
|
158 | print('forties')
|
159 | else:
|
160 | print('other')
|
161 |
|
162 | Switch on **object type**, which goes well with ASDL sum types:
|
163 |
|
164 | val = value.Str('foo) # type: value_t
|
165 | with tagswitch(val) as case:
|
166 | if case(value_e.Str, value_e.Int):
|
167 | print('string or int')
|
168 | else:
|
169 | print('other')
|
170 |
|
171 | We usually need to apply the `UP_val` pattern here, described in the next
|
172 | section.
|
173 |
|
174 | Switch on **string**, which generates a fast **two-level dispatch** -- first on
|
175 | length, and then with `str_equals_c()`:
|
176 |
|
177 | s = 'foo'
|
178 | with str_switch(s) as case:
|
179 | if case("foo")
|
180 | print('FOO')
|
181 | else:
|
182 | print('other')
|
183 |
|
184 | ### `val` → `UP_val` → `val` Downcasting pattern
|
185 |
|
186 | Summary: variable names like `UP_*` are **special** in our Python code.
|
187 |
|
188 | Consider the downcasts marked BAD:
|
189 |
|
190 | val = value.Str('foo) # type: value_t
|
191 |
|
192 | with tagswitch(obj) as case:
|
193 | if case(value_e.Str):
|
194 | val = cast(value.Str, val) # BAD: conflicts with first declaration
|
195 | print('s = %s' % val.s)
|
196 |
|
197 | elif case(value_e.Int):
|
198 | val = cast(value.Int, val) # BAD: conflicts with both
|
199 | print('i = %d' % val.i)
|
200 |
|
201 | else:
|
202 | print('other')
|
203 |
|
204 | MyPy allows this, but it translates to invalid C++ code. C++ can't have a
|
205 | variable named `val`, with 2 related types `value_t` and `value::Str`.
|
206 |
|
207 | So we use this idiom instead, which takes advantage of **local vars in case
|
208 | blocks** in C++:
|
209 |
|
210 | val = value.Str('foo') # type: value_t
|
211 |
|
212 | UP_val = val # temporary variable that will be casted
|
213 |
|
214 | with tagswitch(val) as case:
|
215 | if case(value_e.Str):
|
216 | val = cast(value.Str, UP_val) # this works
|
217 | print('s = %s' % val.s)
|
218 |
|
219 | elif case(value_e.Int):
|
220 | val = cast(value.Int, UP_val) # also works
|
221 | print('i = %d' % val.i)
|
222 |
|
223 | else:
|
224 | print('other')
|
225 |
|
226 | This translates to something like:
|
227 |
|
228 | value_t* val = Alloc<value::Str>(str42);
|
229 | value_t* UP_val = val;
|
230 |
|
231 | switch (val->tag()) {
|
232 | case value_e::Str: {
|
233 | // DIFFERENT local var
|
234 | value::Str* val = static_cast<value::Str>(UP_val);
|
235 | print(StrFormat(str43, val->s))
|
236 | }
|
237 | break;
|
238 | case value_e::Int: {
|
239 | // ANOTHER DIFFERENT local var
|
240 | value::Int* val = static_cast<value::Int>(UP_val);
|
241 | print(StrFormat(str44, val->i))
|
242 | }
|
243 | break;
|
244 | default:
|
245 | print(str45);
|
246 | }
|
247 |
|
248 | This works because there's no problem having **different** variables with the
|
249 | same name within each `case { }` block.
|
250 |
|
251 | Again, the names `UP_*` are **special**. If the name doesn't start with `UP_`,
|
252 | the inner blocks will look like:
|
253 |
|
254 | case value_e::Str: {
|
255 | val = static_cast<value::Str>(val); // BAD: val reused
|
256 | print(StrFormat(str43, val->s))
|
257 | }
|
258 |
|
259 | And they will fail to compile. It's not valid C++ because the superclass
|
260 | `value_t` doesn't have a field `val->s`. Only the subclass `value::Str` has
|
261 | it.
|
262 |
|
263 | (Note that Python has a single flat scope per function, while C++ has nested
|
264 | scopes.)
|
265 |
|
266 | ### Python context manager → C++ constructor and destructor (RAII)
|
267 |
|
268 | This Python code:
|
269 |
|
270 | with ctx_Foo(42):
|
271 | f()
|
272 |
|
273 | translates to this C++ code:
|
274 |
|
275 | {
|
276 | ctx_Foo tmp(42);
|
277 | f()
|
278 |
|
279 | // destructor ~ctx_Foo implicitly called
|
280 | }
|
281 |
|
282 | ## MyPy "Shimming" Technique
|
283 |
|
284 | We have an interesting way of "writing Python and C++ at the same time":
|
285 |
|
286 | 1. First, all Python code must pass the MyPy type checker, and run with a stock
|
287 | Python 2 interpreter.
|
288 | - This is the source of truth — the source of our semantics.
|
289 | 1. We translate most `.py` files to C++, **except** some files, in particular
|
290 | [mycpp/mylib.py]($oils-src) and files starting with `py` like
|
291 | `core/{pyos.pyutil}.py`.
|
292 | 1. In C++, we can substitute custom implementations with the properties we
|
293 | want, like `Dict<K, V>` being ordered, `BigInt` being distinct from C `int`,
|
294 | `BufWriter` being efficient, etc.
|
295 |
|
296 | The MyPy type system is very powerful! It lets us do all this.
|
297 |
|
298 | ### NewDict() for ordered dicts
|
299 |
|
300 | Dicts in Python 2 aren't ordered, but we make them ordered at **runtime** by
|
301 | using `mylib.NewDict()`, which returns `collections_.OrderedDict`.
|
302 |
|
303 | The **static type** is still `Dict[K, V]`, but change the "spec" to be an
|
304 | ordered dict.
|
305 |
|
306 | In C++, `Dict<K, V>` is implemented as an ordered dict. (Note: we don't
|
307 | implement preserving order on deletion, which seems OK.)
|
308 |
|
309 | - TODO: `iteritems()` could go away
|
310 |
|
311 | ### StackArray[T]
|
312 |
|
313 | TODO: describe this when it works.
|
314 |
|
315 | ### BigInt
|
316 |
|
317 | - In Python, it's simply defined a a class with an integer, in
|
318 | [mylib/mops.py]($oils-src).
|
319 | - In C++, it's currently `typedef int64_t BigInt`, but we want to make it a big
|
320 | integer.
|
321 |
|
322 | ### ByteAt(), ByteEquals(), ...
|
323 |
|
324 | Hand optimization to reduce 1-byte strings. For IFS algorithm,
|
325 | `LooksLikeGlob()`, `GlobUnescape()`.
|
326 |
|
327 | ### File / LineReader / BufWriter
|
328 |
|
329 | TODO: describe how this works.
|
330 |
|
331 | Can it be more type safe? I think we can cast `File` to both `LineReader` and
|
332 | `BufWriter`.
|
333 |
|
334 | Or can we invert the relationship, so `File` derives from **both** LineReader
|
335 | and BufWriter?
|
336 |
|
337 | ### Fast JSON - avoid intermediate allocations
|
338 |
|
339 | - `pyj8.WriteString()` is shimmed so we don't create encoded J8 string objects,
|
340 | only to throw them away and write to `mylib.BufWriter`. Instead, we append
|
341 | an encoded strings **directly** to the `BufWriter`.
|
342 | - Likewise, we have `BufWriter::write_spaces` to avoid temporary allocations
|
343 | when writing indents.
|
344 | - This could be generalized to `BufWriter::write_repeated(' ', 42)`.
|
345 | - We may also want `BufWriter::write_slice()`
|
346 |
|
347 | ## Limitations Requiring Source Rewrites
|
348 |
|
349 | mycpp itself may cause limitations on expressiveness, or the C++ language may
|
350 | be able express what we want.
|
351 |
|
352 | - C++ doesn't have `try / except / else`, or `finally`
|
353 | - Use the `with ctx_Foo` pattern instead.
|
354 | - `if mylist` tests if the pointer is non-NULL; use `if len(mylist)` for
|
355 | non-empty test
|
356 | - Functions can have at most one keyword / optional argument.
|
357 | - We generate two methods: `f(x)` which calls `f(x, y)` with the default
|
358 | value of `y`
|
359 | - If there are two or more optional arguments:
|
360 | - For classes, you can use the "builder pattern", i.e. add an
|
361 | `Init_MyMember()` method
|
362 | - If the arguments are booleans, translate it to a single bitfield argument
|
363 | - C++ has nested scope and Python has flat function scope. This can cause name
|
364 | collisions.
|
365 | - Could enforce this if it becomes a problem
|
366 |
|
367 | Also see `mycpp/examples/invalid_*` for Python code that fails to translate.
|
368 |
|
369 | ## WARNING: Assumptions Not Checked
|
370 |
|
371 | ### Global Constants Can't Be Mutated
|
372 |
|
373 | We translate top level constants to statically initialized C data structures
|
374 | (zero startup cost):
|
375 |
|
376 | gStr = 'foo'
|
377 | gList = [1, 2] # type: List[int]
|
378 | gDict = {'bar': 42} # type: Dict[str, int]
|
379 |
|
380 | Even though `List` and `Dict` are mutable in general, you should **NOT** mutate
|
381 | these global instances! The C++ code will break at runtime.
|
382 |
|
383 | ### Gotcha about Returning Variants (Subclasses) of a Type
|
384 |
|
385 | MyPy will accept this code:
|
386 |
|
387 | ```
|
388 | if cond:
|
389 | sig = proc_sig.Open # type: proc_sig_t
|
390 | # bad because mycpp HOISTS this
|
391 | else:
|
392 | sig = proc_sig.Closed.CreateNull()
|
393 | sig.words = words # assignment fails
|
394 | return sig
|
395 | ```
|
396 |
|
397 | It will translate to C++, but fail to compile. Instead, rewrite it like this:
|
398 |
|
399 | ```
|
400 | sig = None # type: proc_sig_t
|
401 | if cond:
|
402 | sig = proc_sig.Open # type: proc_sig_t
|
403 | # bad because mycpp HOISTS this
|
404 | else:
|
405 | closed = proc_sig.Closed.CreateNull()
|
406 | closed.words = words # assignment fails
|
407 | sig = closed
|
408 | return sig
|
409 | ```
|
410 |
|
411 | ### Exceptions Can't Leave Destructors / Python `__exit__`
|
412 |
|
413 | Context managers like `with ctx_Foo():` translate to C++ constructors and
|
414 | destructors.
|
415 |
|
416 | In C++, a destructor can't "leave" an exception. It results in a runtime error.
|
417 |
|
418 | You can throw and CATCH an exception WITHIN a destructor, but you can't let it
|
419 | propagate outside.
|
420 |
|
421 | This means you must be careful when coding the `__exit__` method. For example,
|
422 | in `vm::ctx_Redirect`, we had this bug due to `IOError` being thrown and not
|
423 | caught when restoring/popping redirects.
|
424 |
|
425 | To fix the bug, we rewrote the code to use an out param
|
426 | `List[IOError_OSError]`.
|
427 |
|
428 | Related:
|
429 |
|
430 | - <https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/>
|
431 |
|
432 | ## More Translation Notes
|
433 |
|
434 | ### Hacky Heuristics
|
435 |
|
436 | - `callable(arg)` to either:
|
437 | - function call `f(arg)`
|
438 | - instantiation `Alloc<T>(arg)`
|
439 | - `name.attr` to either:
|
440 | - `obj->member`
|
441 | - `module::Func`
|
442 | - `cast(MyType, obj)` to either
|
443 | - `static_cast<MyType*>(obj)`
|
444 | - `reinterpret_cast<MyType*>(obj)`
|
445 |
|
446 | ### Hacky Hard-Coded Names
|
447 |
|
448 | These are signs of coupling between mycpp and Oils, which ideally shouldn't
|
449 | exist.
|
450 |
|
451 | - `mycpp_main.py`
|
452 | - `ModulesToCompile()` -- some files have to be ordered first, like the ASDL
|
453 | runtime.
|
454 | - TODO: Pea can respect parameter order? So we do that outside the project?
|
455 | - Another ordering constraint comes from **inheritance**. The forward
|
456 | declaration is NOT sufficient in that case.
|
457 | - `cppgen_pass.py`
|
458 | - `_GetCastKind()` has some hard-coded names
|
459 | - `AsdlType::Create()` is special cased to `::`, not `->`
|
460 | - Default arguments e.g. `scope_e::Local` need a repeated `using`.
|
461 |
|
462 | Issue on mycpp improvements: <https://github.com/oilshell/oil/issues/568>
|
463 |
|
464 | ### Major Features
|
465 |
|
466 | - Python `int` and `bool` → C++ `int` and `bool`
|
467 | - `None` → `nullptr`
|
468 | - Statically Typed Python Collections
|
469 | - `str` → `Str*`
|
470 | - `List[T]` → `List<T>*`
|
471 | - `Dict[K, V]` → `Dict<K, V>*`
|
472 | - tuples → `Tuple2<A, B>`, `Tuple3<A, B, C>`, etc.
|
473 | - Collection literals turn into initializer lists
|
474 | - And there is a C++ type inference issue which requires an explicit
|
475 | `std::initializer_list<int>{1, 2, 3}`, not just `{1, 2, 3}`
|
476 | - Python's polymorphic iteration → `StrIter`, `ListIter<T>`, `DictIter<K,
|
477 | V`
|
478 | - `d.iteritems()` is rewritten `mylib.iteritems()` → `DictIter`
|
479 | - TODO: can we be smarter about this?
|
480 | - `reversed(mylist)` → `ReverseListIter`
|
481 | - Python's `in` operator:
|
482 | - `s in mystr` → `str_contains(mystr, s)`
|
483 | - `x in mylist` → `list_contains(mylist, x)`
|
484 | - Classes and inheritance
|
485 | - `__init__` method becomes a constructor. Note: initializer lists aren't
|
486 | used.
|
487 | - Detect `virtual` methods
|
488 | - TODO: could we detect `abstract` methods? (`NotImplementedError`)
|
489 | - Python generators `Iterator[T]` → eager `List<T>` accumulators
|
490 | - Python Exceptions → C++ exceptions
|
491 | - Python Modules → C++ namespace (we assume a 2-level hierarchy)
|
492 | - TODO: mycpp need real modules, because our `oils_for_unix.mycpp.cc`
|
493 | translation unit is getting big.
|
494 | - And `cpp/preamble.h` is a hack to work around the lack of modules.
|
495 |
|
496 | ### Minor Translations
|
497 |
|
498 | - `s1 == s2` → `str_equals(s1, s2)`
|
499 | - `'x' * 3` → `str_repeat(globalStr, 3)`
|
500 | - `[None] * 3` → `list_repeat(nullptr, 3)`
|
501 | - Omitted:
|
502 | - If the LHS of an assignment is `_`, then the statement is omitted
|
503 | - This is for `_ = log`, which shuts up Python lint warnings for 'unused
|
504 | import'
|
505 | - Code under `if __name__ == '__main__'`
|
506 |
|
507 | ### Optimizations
|
508 |
|
509 | - Returning Tuples by value. To reduce GC pressure, we we return
|
510 | `Tuple2<A, B>` instead of `Tuple2<A, B>*`, and likewise for `Tuple3` and `Tuple4`.
|
511 |
|
512 | ### Rooting Policy
|
513 |
|
514 | The translated code roots local variables in every function
|
515 |
|
516 | StackRoots _r({&var1, &var2});
|
517 |
|
518 | We have two kinds of hand-written code:
|
519 |
|
520 | 1. Methods like `Str::strip()` in `mycpp/`
|
521 | 2. OS bindings like `stat()` in `cpp/`
|
522 |
|
523 | Neither of them needs any rooting! This is because we use **manual collection
|
524 | points** in the interpreter, and these functions don't call any functions that
|
525 | can collect. They are "leaves" in the call tree.
|
526 |
|
527 | ## The mycpp Runtime
|
528 |
|
529 | The mycpp translator targets a runtime that's written from scratch. It
|
530 | implements garbage-collected data structures like:
|
531 |
|
532 | - Typed records
|
533 | - Python classes
|
534 | - ASDL product and sum types
|
535 | - `Str` (immutable, as in Python)
|
536 | - `List<T>`
|
537 | - `Dict<K, V>`
|
538 | - `Tuple2<A, B>`, `Tuple3<A, B, C>`, ...
|
539 |
|
540 | It also has functions based on CPython's:
|
541 |
|
542 | - `mycpp/gc_builtins.{h,cc}` corresponds roughly to Python's `__builtin__`
|
543 | module, e.g. `int()` and `str()`
|
544 | - `mycpp/gc_mylib.{h,cc}` corresponds `mylib.py`
|
545 | - `mylib.BufWriter` is a bit like `cStringIO.StringIO`
|
546 |
|
547 | ### Differences from CPython
|
548 |
|
549 | - Integers either C `int` or `mylib.BigInt`, not Python's arbitrary size
|
550 | integers
|
551 | - `NUL` bytes are allowed in arguments to syscalls like `open()`, unlike in
|
552 | CPython
|
553 | - `s.strip()` is defined in terms of ASCII whitespace, which does not include
|
554 | say `\v`.
|
555 | - This is done to be consistent with JSON and J8 Notation.
|
556 |
|
557 | ## C++ Notes
|
558 |
|
559 | ### Gotchas
|
560 |
|
561 | - C++ classes can have 2 member variables of the same name! From the base
|
562 | class and derived class.
|
563 | - Failing to declare methods `virtual` can involve the wrong one being called
|
564 | at runtime
|
565 |
|
566 | ### Minor Features Used
|
567 |
|
568 | In addition to classes, templates, exceptions, etc. mentioned above, we use:
|
569 |
|
570 | - `static_cast` and `reinterpret_cast`
|
571 | - `enum class` for ASDL
|
572 | - Function overloading
|
573 | - For equality and hashing?
|
574 | - `offsetof` for introspection of field positions for garbage collection
|
575 | - `std::initializer_list` for `StackRoots()`
|
576 | - Should we get rid of this?
|
577 |
|
578 | ### Not Used
|
579 |
|
580 | - I/O Streams, RTTI, etc.
|
581 | - `const`
|
582 | - Smart pointers
|
583 |
|