Why Sponsor Oils? | source | all docs for version 0.23.0 | all versions | oilshell.org
This is a Python-to-C++ translator based on MyPy. It only handles the small subset of Python that we use in Oils.
It's inspired by both mypyc and Shed Skin. These posts give background:
As of March 2024, the translation to C++ is done. So it's no longer experimental!
However, it's still pretty hacky. This doc exists mainly to explain the hacks. (We may want to rewrite mycpp as "yaks", although it's low priority right now.)
Source for this doc: mycpp/README.md. The code is all in mycpp/.
oils-cppRunning mycpp is best done on a Debian / Ubuntu-ish machine.  Follow the
instructions at https://github.com/oilshell/oil/wiki/Contributing to create
the "dev build" first, which is DISTINCT from the C++ build.  Make sure you can
run:
oil$ build/py.sh all
This will give you a working shell:
oil$ bin/osh -c 'echo hi'  # running interpreted Python
hi
To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's dependencies. First install packages:
# We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
oil$ build/deps.sh install-ubuntu-packages
Then fetch data, like the Python 3.10 tarball and MyPy repo:
oil$ build/deps.sh fetch
Then build from source:
oil$ build/deps.sh install-wedges
To build oil-native, use:
oil$ ./NINJA-config.sh
oil$ ninja              # translate and compile, may take 30 seconds
oil$ _bin/cxx-asan/osh -c 'echo hi'  # running compiled C++ !
hi
To run the tests and benchmarks:
oil$ mycpp/TEST.sh test-translator
... 200+ tasks run ...
If you have problems, post a message on #oil-dev at
https://oilshell.zulipchat.com.  Not many people have contributed to mycpp,
so I can use your feedback!
Related:
There are four passes over the MyPy AST.
(1) const_pass.py: Collect string constants
Turn turn the constant in myfunc("foo") into top-level GLOBAL_STR(str1, "foo").
(2) Three passes in cppgen_pass.py.
(a) Forward Declaration Pass.
class Foo;
class Bar;
This pass also determines which methods should be declared virtual in their
declarations.  The virtual keyword is written in the next pass.
(b) Declaration Pass.
class Foo {
  void method();
};
class Bar {
  void method();
};
More work in this pass:
(c) Definition Pass.
void Foo:method() {
  ...
}
void Bar:method() {
  ...
}
Note: I really wish we were not using visitors, but that's inherited from MyPy.
Oils is written in typed Python 2. It will run under a stock Python 2 interpreter, and it will typecheck with stock MyPy.
However, there are a few language features that don't map cleanly from typed Python to C++:
static_cast and
reinterpret_cast.  (We don't use C-style casting.)So this describes the idioms we use.  There are some hacks in
mycpp/cppgen_pass.py to handle these cases, and also Python
runtime equivalents in mycpp/mylib.py.
with {,tag,str_}switch → Switch statementWe have three constructs that translate to a C++ switch statement.  They use a
Python context manager with Xswitch(obj) ... as a little hack.
Here are examples like the ones in mycpp/examples/test_switch.py.
(ninja mycpp-logs-equal translates, compiles, and tests all the examples.)
Simple switch:
myint = 99
with switch(myint) as case:
    if case(42, 43):
        print('forties')
    else:
        print('other')
Switch on object type, which goes well with ASDL sum types:
val = value.Str('foo)  # type: value_t
with tagswitch(val) as case:
    if case(value_e.Str, value_e.Int):
        print('string or int')
    else:
        print('other')
We usually need to apply the UP_val pattern here, described in the next
section.
Switch on string, which generates a fast two-level dispatch -- first on
length, and then with str_equals_c():
s = 'foo'
with str_switch(s) as case:
    if case("foo")
        print('FOO')
    else:
        print('other')
val → UP_val → val Downcasting patternSummary: variable names like UP_* are special in our Python code.
Consider the downcasts marked BAD:
val = value.Str('foo)  # type: value_t
with tagswitch(obj) as case:
    if case(value_e.Str):
        val = cast(value.Str, val)  # BAD: conflicts with first declaration
        print('s = %s' % val.s)
    elif case(value_e.Int):
        val = cast(value.Int, val)  # BAD: conflicts with both
        print('i = %d' % val.i)
    else:
        print('other')
MyPy allows this, but it translates to invalid C++ code.  C++ can't have a
variable named val, with 2 related types value_t and value::Str.
So we use this idiom instead, which takes advantage of local vars in case blocks in C++:
val = value.Str('foo')  # type: value_t
UP_val = val  # temporary variable that will be casted
with tagswitch(val) as case:
    if case(value_e.Str):
        val = cast(value.Str, UP_val)  # this works
        print('s = %s' % val.s)
    elif case(value_e.Int):
        val = cast(value.Int, UP_val)  # also works
        print('i = %d' % val.i)
    else:
        print('other')
This translates to something like:
value_t* val = Alloc<value::Str>(str42);
value_t* UP_val = val;
switch (val->tag()) {
    case value_e::Str: {
        // DIFFERENT local var
        value::Str* val = static_cast<value::Str>(UP_val);
        print(StrFormat(str43, val->s))
    }
        break;
    case value_e::Int: {
        // ANOTHER DIFFERENT local var
        value::Int* val = static_cast<value::Int>(UP_val);
        print(StrFormat(str44, val->i))
    }
        break;
    default:
        print(str45);
}
This works because there's no problem having different variables with the
same name within each case { } block.
Again, the names UP_* are special.  If the name doesn't start with UP_,
the inner blocks will look like:
    case value_e::Str: {
        val = static_cast<value::Str>(val);  // BAD: val reused
        print(StrFormat(str43, val->s))
    }
And they will fail to compile.  It's not valid C++ because the superclass
value_t doesn't have a field val->s.  Only the subclass value::Str has
it.
(Note that Python has a single flat scope per function, while C++ has nested scopes.)
This Python code:
with ctx_Foo(42):
  f()
translates to this C++ code:
{
  ctx_Foo tmp(42);
  f()
  // destructor ~ctx_Foo implicitly called
}
We have an interesting way of "writing Python and C++ at the same time":
.py files to C++, except some files, in particular
mycpp/mylib.py and files starting with py like
core/{pyos.pyutil}.py.Dict<K, V> being ordered, BigInt being distinct from C int,
BufWriter being efficient, etc.The MyPy type system is very powerful! It lets us do all this.
Dicts in Python 2 aren't ordered, but we make them ordered at runtime by
using mylib.NewDict(), which returns collections_.OrderedDict.
The static type is still Dict[K, V], but change the "spec" to be an
ordered dict.
In C++, Dict<K, V> is implemented as an ordered dict.  (Note: we don't
implement preserving order on deletion, which seems OK.)
iteritems() could go awayTODO: describe this when it works.
typedef int64_t BigInt, but we want to make it a big
integer.Hand optimization to reduce 1-byte strings.  For IFS algorithm,
LooksLikeGlob(), GlobUnescape().
TODO: describe how this works.
Can it be more type safe?  I think we can cast File to both LineReader and
BufWriter.
Or can we invert the relationship, so File derives from both LineReader
and BufWriter?
pyj8.WriteString() is shimmed so we don't create encoded J8 string objects,
only to throw them away and write to mylib.BufWriter.  Instead, we append
an encoded strings directly to the BufWriter.BufWriter::write_spaces to avoid temporary allocations
when writing indents.
BufWriter::write_repeated(' ', 42).BufWriter::write_slice()mycpp itself may cause limitations on expressiveness, or the C++ language may be able express what we want.
try / except / else, or finally
with ctx_Foo pattern instead.if mylist tests if the pointer is non-NULL; use if len(mylist) for
non-empty testf(x) which calls f(x, y) with the default
value of yInit_MyMember() methodAlso see mycpp/examples/invalid_* for Python code that fails to translate.
We translate top level constants to statically initialized C data structures (zero startup cost):
gStr = 'foo'   
gList = [1, 2]  # type: List[int]
gDict = {'bar': 42}  # type: Dict[str, int]
Even though List and Dict are mutable in general, you should NOT mutate
these global instances!  The C++ code will break at runtime.
MyPy will accept this code:
if cond:
  sig = proc_sig.Open  # type: proc_sig_t
                       # bad because mycpp HOISTS this
else:
  sig = proc_sig.Closed.CreateNull()
  sig.words = words    # assignment fails
return sig
It will translate to C++, but fail to compile. Instead, rewrite it like this:
sig = None  # type: proc_sig_t
if cond:
  sig = proc_sig.Open  # type: proc_sig_t
                       # bad because mycpp HOISTS this
else:
  closed = proc_sig.Closed.CreateNull()
  closed.words = words    # assignment fails
  sig = closed
return sig
__exit__Context managers like with ctx_Foo(): translate to C++ constructors and
destructors.
In C++, a destructor can't "leave" an exception. It results in a runtime error.
You can throw and CATCH an exception WITHIN a destructor, but you can't let it propagate outside.
This means you must be careful when coding the __exit__ method.  For example,
in vm::ctx_Redirect, we had this bug due to IOError being thrown and not
caught when restoring/popping redirects.
To fix the bug, we rewrote the code to use an out param
List[IOError_OSError].
Related:
callable(arg) to either:
f(arg)Alloc<T>(arg)name.attr to either:
obj->membermodule::Funccast(MyType, obj) to either
static_cast<MyType*>(obj)reinterpret_cast<MyType*>(obj)These are signs of coupling between mycpp and Oils, which ideally shouldn't exist.
mycpp_main.py
ModulesToCompile() -- some files have to be ordered first, like the ASDL
runtime.
cppgen_pass.py
_GetCastKind() has some hard-coded namesAsdlType::Create() is special cased to ::, not ->scope_e::Local need a repeated using.Issue on mycpp improvements: https://github.com/oilshell/oil/issues/568
int and bool → C++ int and bool
None → nullptrstr → Str*List[T] → List<T>*Dict[K, V] → Dict<K, V>*Tuple2<A, B>, Tuple3<A, B, C>, etc.std::initializer_list<int>{1, 2, 3}, not just {1, 2, 3}StrIter, ListIter<T>, DictIter<K, V
d.iteritems() is rewritten mylib.iteritems() → DictIter
reversed(mylist) → ReverseListIterin operator:
s in mystr → str_contains(mystr, s)x in mylist → list_contains(mylist, x)__init__ method becomes a constructor.  Note: initializer lists aren't
used.virtual methodsabstract methods? (NotImplementedError)Iterator[T] → eager List<T> accumulatorsoils_for_unix.mycpp.cc
translation unit is getting big.cpp/preamble.h is a hack to work around the lack of modules.s1 == s2 → str_equals(s1, s2)'x' * 3 → str_repeat(globalStr, 3)[None] * 3 → list_repeat(nullptr, 3)_, then the statement is omitted
_ = log, which shuts up Python lint warnings for 'unused
import'if __name__ == '__main__'Tuple2<A, B> instead of Tuple2<A, B>*, and likewise for Tuple3 and Tuple4.The translated code roots local variables in every function
StackRoots _r({&var1, &var2});
We have two kinds of hand-written code:
Str::strip() in mycpp/stat() in cpp/Neither of them needs any rooting! This is because we use manual collection points in the interpreter, and these functions don't call any functions that can collect. They are "leaves" in the call tree.
The mycpp translator targets a runtime that's written from scratch. It implements garbage-collected data structures like:
Str (immutable, as in Python)List<T>Dict<K, V>Tuple2<A, B>, Tuple3<A, B, C>, ...It also has functions based on CPython's:
mycpp/gc_builtins.{h,cc} corresponds roughly to Python's __builtin__
module, e.g. int() and str()mycpp/gc_mylib.{h,cc} corresponds mylib.py
mylib.BufWriter is a bit like cStringIO.StringIOint or mylib.BigInt, not Python's arbitrary size
integersNUL bytes are allowed in arguments to syscalls like open(), unlike in
CPythons.strip() is defined in terms of ASCII whitespace, which does not include
say \v.
virtual can involve the wrong one being called
at runtimeIn addition to classes, templates, exceptions, etc. mentioned above, we use:
static_cast and reinterpret_castenum class for ASDLoffsetof for introspection of field positions for garbage collectionstd::initializer_list for StackRoots()
const