OILS / mycpp / README.md View on Github | oilshell.org

583 lines, 417 significant
1mycpp
2=====
3
4This is a Python-to-C++ translator based on MyPy. It only
5handles the small subset of Python that we use in Oils.
6
7It's inspired by both mypyc and Shed Skin. These posts give background:
8
9- [Brief Descriptions of a Python to C++ Translator](https://www.oilshell.org/blog/2022/05/mycpp.html)
10- [Oil Is Being Implemented "Middle Out"](https://www.oilshell.org/blog/2022/03/middle-out.html)
11
12As of March 2024, the translation to C++ is **done**. So it's no longer
13experimental!
14
15However, it's still pretty **hacky**. This doc exists mainly to explain the
16hacks. (We may want to rewrite mycpp as "yaks", although it's low priority
17right now.)
18
19---
20
21Source for this doc: [mycpp/README.md]($oils-src). The code is all in
22[mycpp/]($oils-src).
23
24
25<div id="toc">
26</div>
27
28## Instructions
29
30### Translating and Compiling `oils-cpp`
31
32Running `mycpp` is best done on a Debian / Ubuntu-ish machine. Follow the
33instructions at <https://github.com/oilshell/oil/wiki/Contributing> to create
34the "dev build" first, which is DISTINCT from the C++ build. Make sure you can
35run:
36
37 oil$ build/py.sh all
38
39This will give you a working shell:
40
41 oil$ bin/osh -c 'echo hi' # running interpreted Python
42 hi
43
44To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's
45dependencies. First install packages:
46
47 # We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
48 oil$ build/deps.sh install-ubuntu-packages
49
50Then fetch data, like the Python 3.10 tarball and MyPy repo:
51
52 oil$ build/deps.sh fetch
53
54Then build from source:
55
56 oil$ build/deps.sh install-wedges
57
58To build oil-native, use:
59
60 oil$ ./NINJA-config.sh
61 oil$ ninja # translate and compile, may take 30 seconds
62
63 oil$ _bin/cxx-asan/osh -c 'echo hi' # running compiled C++ !
64 hi
65
66To run the tests and benchmarks:
67
68 oil$ mycpp/TEST.sh test-translator
69 ... 200+ tasks run ...
70
71If you have problems, post a message on `#oil-dev` at
72`https://oilshell.zulipchat.com`. Not many people have contributed to `mycpp`,
73so I can use your feedback!
74
75Related:
76
77- [Oil Native Quick
78Start](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start) on the
79wiki.
80- [Oil Dev Cheat Sheet](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start)
81
82## Notes on the Algorithm / Architecture
83
84There are four passes over the MyPy AST.
85
86(1) `const_pass.py`: Collect string constants
87
88Turn turn the constant in `myfunc("foo")` into top-level `GLOBAL_STR(str1,
89"foo")`.
90
91(2) Three passes in `cppgen_pass.py`.
92
93(a) Forward Declaration Pass.
94
95 class Foo;
96 class Bar;
97
98This pass also determines which methods should be declared `virtual` in their
99declarations. The `virtual` keyword is written in the next pass.
100
101(b) Declaration Pass.
102
103 class Foo {
104 void method();
105 };
106 class Bar {
107 void method();
108 };
109
110More work in this pass:
111
112- Collect member variables and write them at the end of the definition
113- Collect locals for "hoisting". Written in the next pass.
114- Creates `fmtN()` functions to compile Python's `%` formatting operator.
115
116(c) Definition Pass.
117
118 void Foo:method() {
119 ...
120 }
121
122 void Bar:method() {
123 ...
124 }
125
126Note: I really wish we were not using visitors, but that's inherited from MyPy.
127
128## mycpp Idioms / "Creative Hacks"
129
130Oils is written in typed Python 2. It will run under a stock Python 2
131interpreter, and it will typecheck with stock MyPy.
132
133However, there are a few language features that don't map cleanly from typed
134Python to C++:
135
136- switch statements (unfortunately we don't have the Python 3 match statement)
137- C++ destructors - the RAII ptatern
138- casting - MyPy has one kind of cast; C++ has `static_cast` and
139 `reinterpret_cast`. (We don't use C-style casting.)
140
141So this describes the idioms we use. There are some hacks in
142[mycpp/cppgen_pass.py]($oils-src) to handle these cases, and also Python
143runtime equivalents in `mycpp/mylib.py`.
144
145### `with {,tag,str_}switch` &rarr; Switch statement
146
147We have three constructs that translate to a C++ switch statement. They use a
148Python context manager `with Xswitch(obj) ...` as a little hack.
149
150Here are examples like the ones in [mycpp/examples/test_switch.py]($oils-src).
151(`ninja mycpp-logs-equal` translates, compiles, and tests all the examples.)
152
153Simple switch:
154
155 myint = 99
156 with switch(myint) as case:
157 if case(42, 43):
158 print('forties')
159 else:
160 print('other')
161
162Switch on **object type**, which goes well with ASDL sum types:
163
164 val = value.Str('foo) # type: value_t
165 with tagswitch(val) as case:
166 if case(value_e.Str, value_e.Int):
167 print('string or int')
168 else:
169 print('other')
170
171We usually need to apply the `UP_val` pattern here, described in the next
172section.
173
174Switch on **string**, which generates a fast **two-level dispatch** -- first on
175length, and then with `str_equals_c()`:
176
177 s = 'foo'
178 with str_switch(s) as case:
179 if case("foo")
180 print('FOO')
181 else:
182 print('other')
183
184### `val` &rarr; `UP_val` &rarr; `val` Downcasting pattern
185
186Summary: variable names like `UP_*` are **special** in our Python code.
187
188Consider the downcasts marked BAD:
189
190 val = value.Str('foo) # type: value_t
191
192 with tagswitch(obj) as case:
193 if case(value_e.Str):
194 val = cast(value.Str, val) # BAD: conflicts with first declaration
195 print('s = %s' % val.s)
196
197 elif case(value_e.Int):
198 val = cast(value.Int, val) # BAD: conflicts with both
199 print('i = %d' % val.i)
200
201 else:
202 print('other')
203
204MyPy allows this, but it translates to invalid C++ code. C++ can't have a
205variable named `val`, with 2 related types `value_t` and `value::Str`.
206
207So we use this idiom instead, which takes advantage of **local vars in case
208blocks** in C++:
209
210 val = value.Str('foo') # type: value_t
211
212 UP_val = val # temporary variable that will be casted
213
214 with tagswitch(val) as case:
215 if case(value_e.Str):
216 val = cast(value.Str, UP_val) # this works
217 print('s = %s' % val.s)
218
219 elif case(value_e.Int):
220 val = cast(value.Int, UP_val) # also works
221 print('i = %d' % val.i)
222
223 else:
224 print('other')
225
226This translates to something like:
227
228 value_t* val = Alloc<value::Str>(str42);
229 value_t* UP_val = val;
230
231 switch (val->tag()) {
232 case value_e::Str: {
233 // DIFFERENT local var
234 value::Str* val = static_cast<value::Str>(UP_val);
235 print(StrFormat(str43, val->s))
236 }
237 break;
238 case value_e::Int: {
239 // ANOTHER DIFFERENT local var
240 value::Int* val = static_cast<value::Int>(UP_val);
241 print(StrFormat(str44, val->i))
242 }
243 break;
244 default:
245 print(str45);
246 }
247
248This works because there's no problem having **different** variables with the
249same name within each `case { }` block.
250
251Again, the names `UP_*` are **special**. If the name doesn't start with `UP_`,
252the inner blocks will look like:
253
254 case value_e::Str: {
255 val = static_cast<value::Str>(val); // BAD: val reused
256 print(StrFormat(str43, val->s))
257 }
258
259And they will fail to compile. It's not valid C++ because the superclass
260`value_t` doesn't have a field `val->s`. Only the subclass `value::Str` has
261it.
262
263(Note that Python has a single flat scope per function, while C++ has nested
264scopes.)
265
266### Python context manager &rarr; C++ constructor and destructor (RAII)
267
268This Python code:
269
270 with ctx_Foo(42):
271 f()
272
273translates to this C++ code:
274
275 {
276 ctx_Foo tmp(42);
277 f()
278
279 // destructor ~ctx_Foo implicitly called
280 }
281
282## MyPy "Shimming" Technique
283
284We have an interesting way of "writing Python and C++ at the same time":
285
2861. First, all Python code must pass the MyPy type checker, and run with a stock
287 Python 2 interpreter.
288 - This is the source of truth &mdash; the source of our semantics.
2891. We translate most `.py` files to C++, **except** some files, in particular
290 [mycpp/mylib.py]($oils-src) and files starting with `py` like
291 `core/{pyos.pyutil}.py`.
2921. In C++, we can substitute custom implementations with the properties we
293 want, like `Dict<K, V>` being ordered, `BigInt` being distinct from C `int`,
294 `BufWriter` being efficient, etc.
295
296The MyPy type system is very powerful! It lets us do all this.
297
298### NewDict() for ordered dicts
299
300Dicts in Python 2 aren't ordered, but we make them ordered at **runtime** by
301using `mylib.NewDict()`, which returns `collections_.OrderedDict`.
302
303The **static type** is still `Dict[K, V]`, but change the "spec" to be an
304ordered dict.
305
306In C++, `Dict<K, V>` is implemented as an ordered dict. (Note: we don't
307implement preserving order on deletion, which seems OK.)
308
309- TODO: `iteritems()` could go away
310
311### StackArray[T]
312
313TODO: describe this when it works.
314
315### BigInt
316
317- In Python, it's simply defined a a class with an integer, in
318 [mylib/mops.py]($oils-src).
319- In C++, it's currently `typedef int64_t BigInt`, but we want to make it a big
320 integer.
321
322### ByteAt(), ByteEquals(), ...
323
324Hand optimization to reduce 1-byte strings. For IFS algorithm,
325`LooksLikeGlob()`, `GlobUnescape()`.
326
327### File / LineReader / BufWriter
328
329TODO: describe how this works.
330
331Can it be more type safe? I think we can cast `File` to both `LineReader` and
332`BufWriter`.
333
334Or can we invert the relationship, so `File` derives from **both** LineReader
335and BufWriter?
336
337### Fast JSON - avoid intermediate allocations
338
339- `pyj8.WriteString()` is shimmed so we don't create encoded J8 string objects,
340 only to throw them away and write to `mylib.BufWriter`. Instead, we append
341 an encoded strings **directly** to the `BufWriter`.
342- Likewise, we have `BufWriter::write_spaces` to avoid temporary allocations
343 when writing indents.
344 - This could be generalized to `BufWriter::write_repeated(' ', 42)`.
345- We may also want `BufWriter::write_slice()`
346
347## Limitations Requiring Source Rewrites
348
349mycpp itself may cause limitations on expressiveness, or the C++ language may
350be able express what we want.
351
352- C++ doesn't have `try / except / else`, or `finally`
353 - Use the `with ctx_Foo` pattern instead.
354- `if mylist` tests if the pointer is non-NULL; use `if len(mylist)` for
355 non-empty test
356- Functions can have at most one keyword / optional argument.
357 - We generate two methods: `f(x)` which calls `f(x, y)` with the default
358 value of `y`
359 - If there are two or more optional arguments:
360 - For classes, you can use the "builder pattern", i.e. add an
361 `Init_MyMember()` method
362 - If the arguments are booleans, translate it to a single bitfield argument
363- C++ has nested scope and Python has flat function scope. This can cause name
364 collisions.
365 - Could enforce this if it becomes a problem
366
367Also see `mycpp/examples/invalid_*` for Python code that fails to translate.
368
369## WARNING: Assumptions Not Checked
370
371### Global Constants Can't Be Mutated
372
373We translate top level constants to statically initialized C data structures
374(zero startup cost):
375
376 gStr = 'foo'
377 gList = [1, 2] # type: List[int]
378 gDict = {'bar': 42} # type: Dict[str, int]
379
380Even though `List` and `Dict` are mutable in general, you should **NOT** mutate
381these global instances! The C++ code will break at runtime.
382
383### Gotcha about Returning Variants (Subclasses) of a Type
384
385MyPy will accept this code:
386
387```
388if cond:
389 sig = proc_sig.Open # type: proc_sig_t
390 # bad because mycpp HOISTS this
391else:
392 sig = proc_sig.Closed.CreateNull()
393 sig.words = words # assignment fails
394return sig
395```
396
397It will translate to C++, but fail to compile. Instead, rewrite it like this:
398
399```
400sig = None # type: proc_sig_t
401if cond:
402 sig = proc_sig.Open # type: proc_sig_t
403 # bad because mycpp HOISTS this
404else:
405 closed = proc_sig.Closed.CreateNull()
406 closed.words = words # assignment fails
407 sig = closed
408return sig
409```
410
411### Exceptions Can't Leave Destructors / Python `__exit__`
412
413Context managers like `with ctx_Foo():` translate to C++ constructors and
414destructors.
415
416In C++, a destructor can't "leave" an exception. It results in a runtime error.
417
418You can throw and CATCH an exception WITHIN a destructor, but you can't let it
419propagate outside.
420
421This means you must be careful when coding the `__exit__` method. For example,
422in `vm::ctx_Redirect`, we had this bug due to `IOError` being thrown and not
423caught when restoring/popping redirects.
424
425To fix the bug, we rewrote the code to use an out param
426`List[IOError_OSError]`.
427
428Related:
429
430- <https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/>
431
432## More Translation Notes
433
434### Hacky Heuristics
435
436- `callable(arg)` to either:
437 - function call `f(arg)`
438 - instantiation `Alloc<T>(arg)`
439- `name.attr` to either:
440 - `obj->member`
441 - `module::Func`
442- `cast(MyType, obj)` to either
443 - `static_cast<MyType*>(obj)`
444 - `reinterpret_cast<MyType*>(obj)`
445
446### Hacky Hard-Coded Names
447
448These are signs of coupling between mycpp and Oils, which ideally shouldn't
449exist.
450
451- `mycpp_main.py`
452 - `ModulesToCompile()` -- some files have to be ordered first, like the ASDL
453 runtime.
454 - TODO: Pea can respect parameter order? So we do that outside the project?
455 - Another ordering constraint comes from **inheritance**. The forward
456 declaration is NOT sufficient in that case.
457- `cppgen_pass.py`
458 - `_GetCastKind()` has some hard-coded names
459 - `AsdlType::Create()` is special cased to `::`, not `->`
460 - Default arguments e.g. `scope_e::Local` need a repeated `using`.
461
462Issue on mycpp improvements: <https://github.com/oilshell/oil/issues/568>
463
464### Major Features
465
466- Python `int` and `bool` &rarr; C++ `int` and `bool`
467 - `None` &rarr; `nullptr`
468- Statically Typed Python Collections
469 - `str` &rarr; `Str*`
470 - `List[T]` &rarr; `List<T>*`
471 - `Dict[K, V]` &rarr; `Dict<K, V>*`
472 - tuples &rarr; `Tuple2<A, B>`, `Tuple3<A, B, C>`, etc.
473- Collection literals turn into initializer lists
474 - And there is a C++ type inference issue which requires an explicit
475 `std::initializer_list<int>{1, 2, 3}`, not just `{1, 2, 3}`
476- Python's polymorphic iteration &rarr; `StrIter`, `ListIter<T>`, `DictIter<K,
477 V`
478 - `d.iteritems()` is rewritten `mylib.iteritems()` &rarr; `DictIter`
479 - TODO: can we be smarter about this?
480 - `reversed(mylist)` &rarr; `ReverseListIter`
481- Python's `in` operator:
482 - `s in mystr` &rarr; `str_contains(mystr, s)`
483 - `x in mylist` &rarr; `list_contains(mylist, x)`
484- Classes and inheritance
485 - `__init__` method becomes a constructor. Note: initializer lists aren't
486 used.
487 - Detect `virtual` methods
488 - TODO: could we detect `abstract` methods? (`NotImplementedError`)
489- Python generators `Iterator[T]` &rarr; eager `List<T>` accumulators
490- Python Exceptions &rarr; C++ exceptions
491- Python Modules &rarr; C++ namespace (we assume a 2-level hierarchy)
492 - TODO: mycpp need real modules, because our `oils_for_unix.mycpp.cc`
493 translation unit is getting big.
494 - And `cpp/preamble.h` is a hack to work around the lack of modules.
495
496### Minor Translations
497
498- `s1 == s2` &rarr; `str_equals(s1, s2)`
499- `'x' * 3` &rarr; `str_repeat(globalStr, 3)`
500- `[None] * 3` &rarr; `list_repeat(nullptr, 3)`
501- Omitted:
502 - If the LHS of an assignment is `_`, then the statement is omitted
503 - This is for `_ = log`, which shuts up Python lint warnings for 'unused
504 import'
505 - Code under `if __name__ == '__main__'`
506
507### Optimizations
508
509- Returning Tuples by value. To reduce GC pressure, we we return
510 `Tuple2<A, B>` instead of `Tuple2<A, B>*`, and likewise for `Tuple3` and `Tuple4`.
511
512### Rooting Policy
513
514The translated code roots local variables in every function
515
516 StackRoots _r({&var1, &var2});
517
518We have two kinds of hand-written code:
519
5201. Methods like `Str::strip()` in `mycpp/`
5212. OS bindings like `stat()` in `cpp/`
522
523Neither of them needs any rooting! This is because we use **manual collection
524points** in the interpreter, and these functions don't call any functions that
525can collect. They are "leaves" in the call tree.
526
527## The mycpp Runtime
528
529The mycpp translator targets a runtime that's written from scratch. It
530implements garbage-collected data structures like:
531
532- Typed records
533 - Python classes
534 - ASDL product and sum types
535- `Str` (immutable, as in Python)
536- `List<T>`
537- `Dict<K, V>`
538- `Tuple2<A, B>`, `Tuple3<A, B, C>`, ...
539
540It also has functions based on CPython's:
541
542- `mycpp/gc_builtins.{h,cc}` corresponds roughly to Python's `__builtin__`
543 module, e.g. `int()` and `str()`
544- `mycpp/gc_mylib.{h,cc}` corresponds `mylib.py`
545 - `mylib.BufWriter` is a bit like `cStringIO.StringIO`
546
547### Differences from CPython
548
549- Integers either C `int` or `mylib.BigInt`, not Python's arbitrary size
550 integers
551- `NUL` bytes are allowed in arguments to syscalls like `open()`, unlike in
552 CPython
553- `s.strip()` is defined in terms of ASCII whitespace, which does not include
554 say `\v`.
555 - This is done to be consistent with JSON and J8 Notation.
556
557## C++ Notes
558
559### Gotchas
560
561- C++ classes can have 2 member variables of the same name! From the base
562 class and derived class.
563- Failing to declare methods `virtual` can involve the wrong one being called
564 at runtime
565
566### Minor Features Used
567
568In addition to classes, templates, exceptions, etc. mentioned above, we use:
569
570- `static_cast` and `reinterpret_cast`
571- `enum class` for ASDL
572- Function overloading
573 - For equality and hashing?
574- `offsetof` for introspection of field positions for garbage collection
575- `std::initializer_list` for `StackRoots()`
576 - Should we get rid of this?
577
578### Not Used
579
580- I/O Streams, RTTI, etc.
581- `const`
582- Smart pointers
583