Egg Expressions (YSH Regexes)

YSH has a new syntax for patterns, which appears between the / / delimiters:

if (mystr ~ /d+ '.' d+/) {   
  echo 'mystr looks like a number N.M'
}

These patterns are intended to be familiar, but they differ from POSIX or Perl expressions in important ways. So we call them eggexes rather than regexes!

Table of Contents
Why Invent a New Language?
Example of Pattern Reuse
Design Philosophy
The Expression Language Is Consistent
Expression Primitives
. Is Now dot
Classes Are Unadorned: word, w, alnum
Zero-width Assertions Look Like %this
Single-Quoted Strings
Compound Expressions
Sequence and Alternation Are Unchanged
Repetition Is Unchanged In Common Cases, and Better in Rare Cases
Negation Consistently Uses !
Splice Other Patterns @var_name or UpperCaseVarName
Group With ()
Capture with <capture ...>
Character Class Literals Use []
Backtracking Constructs Use !! (Discouraged)
Outside the Expression language
Flags and Translation Preferences (;)
Multiline Syntax
The YSH API
Language Reference
Usage Notes
Use character literals rather than C-Escaped strings
POSIX ERE Limitations
Repetition of Strings Requires Grouping
Unicode char literals are limited in range
Don't put non-ASCII bytes in string sets in char classes
Char class literals: ^ - ] \
Critiques
Regexes Are Hard To Read
YSH is Shorter Than Bash
... and Perl
Design Notes
Eggexes In Other Languages
Backward Compatibility
FAQ
The Name Sounds Funny.
How Do Eggexes Compare with Raku Regexes and the Rosie Pattern Language?
What About Eggex versus Parsing Expression Grammars? (PEGs)
Why Don't dot, %start, and %end Have More Precise Names?
Where Do I Send Feedback?

Why Invent a New Language?

Example of Pattern Reuse

Here's a longer example:

# Define a subpattern.  'digit' and 'd' are the same.
$ var D = / digit{1,3} /

# Use the subpattern
$ var ip_pat = / D '.' D '.' D '.' D /

# This eggex compiles to an ERE
$ echo $ip_pat
[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}

This means you can use it in a very simple way:

$ egrep $ip_pat foo.txt

TODO: You should also be able to inline patterns like this:

egrep $/d+/ foo.txt

Design Philosophy

The Expression Language Is Consistent

Eggexes have a consistent syntax:

For example, it's easy to see that these patterns all match three characters:

/ d d d /
/ digit digit digit /
/ dot dot dot /
/ word space word /
/ 'ab' space /
/ 'abc' /

And that these patterns match two:

/ %start w w /
/ %start 'if' /
/ d d %end /

And that you have to look up the definition of HexDigit to know how many characters this matches:

/ %start HexDigit %end /

Constructs like . ^ $ \< \> are deprecated because they break these rules.

Expression Primitives

. Is Now dot

But . is still accepted. It usually matches any character except a newline, although this changes based on flags (e.g. dotall, unicode).

Classes Are Unadorned: word, w, alnum

We accept both Perl and POSIX classes.

Zero-width Assertions Look Like %this

Single-Quoted Strings

Note: instead of using double-quoted strings like "xyz $var", you can splice a strings into an eggex:

/ 'xyz ' @var /

Compound Expressions

Sequence and Alternation Are Unchanged

You can also write a more Pythonic alternative: x or y.

Repetition Is Unchanged In Common Cases, and Better in Rare Cases

Repetition is just like POSIX ERE or Perl:

We've reserved syntactic space for PCRE and Python variants:

Negation Consistently Uses !

You can negate named char classes:

/ !digit /

and char class literals:

/ ![ a-z A-Z ] /

Sometimes you can do both:

/ ![ !digit ] /  # translates to /[^\D]/ in PCRE
                 # error in ERE because it can't be expressed

You can also negate "regex modifiers" / compilation flags:

/ word ; ignorecase /   # flag on
/ word ; !ignorecase /  # flag off
/ word ; !i /           # abbreviated

In contrast, regexes have many confusing syntaxes for negation:

[^abc] vs. [abc]
[[^:digit:]] vs. [[:digit:]]

\D vs. \d

/\w/-i vs /\w/i

Splice Other Patterns @var_name or UpperCaseVarName

This allows you to reuse patterns. Using uppercase variables:

var D = / digit{3} /

var ip_addr = / D '.' D '.' D '.' D /

Using normal variables:

var part = / digit{3} /

var ip_addr = / @part '.' @part '.' @part '.' @part /

This is similar to how lex and re2c work.

Group With ()

Parentheses are used for precdence:

('foo' | 'bar')+

See note below: When translating to POSIX ERE, grouping becomes a capturing group. POSIX ERE has no non-capturing groups.

Capture with <capture ...>

Here's a positional capture:

<capture d+>           # Becomes _group(1)

Add a variable after as for named capture:

<capture d+ as month>  # Becomes _group('month')

You can also add type conversion functions:

<capture d+ : int>           # _group(1) returns an Int, not Str
<capture d+ as month: int>   # _group('month') returns an Int, not Str

Character Class Literals Use []

Example:

[ a-f 'A'-'F' \xFF \u{03bc} \n \\ \' \" \0 ]

Terms:

Only letters, numbers, and the underscore may be unquoted:

/['a'-'f' 'A'-'F' '0'-'9']/
/[a-f A-F 0-9]/              # Equivalent to the above

/['!' - ')']/                # Correct range
/[!-)]/                      # Syntax Error

Ranges must be separated by spaces:

No:

/[a-fA-F0-9]/

Yes:

/[a-f A-f 0-9]/

Backtracking Constructs Use !! (Discouraged)

If you want to translate to PCRE, you can use these.

!!REF 1
!!REF name

!!AHEAD( d+ )
!!NOT_AHEAD( d+ )
!!BEHIND( d+ )
!!NOT_BEHIND( d+ )

!!ATOMIC( d+ )

Since they all begin with !!, You can visually audit your code for potential performance problems.

Outside the Expression language

Flags and Translation Preferences (;)

Flags or "regex modifiers" appear after a semicolon:

/ digit+ ; i /  # ignore case

A translation preference is specified after a second semi-colon:

/ digit+ ; ; ERE /                # translates to [[:digit:]]+
/ digit+ ; ; python /             # could translate to \d+

Flags and translation preferences together:

/ digit+ ; ignorecase ; python /  # could translate to (?i)\d+

In Oils, the following flags are currently supported:

reg_icase / i (Ignore Case)

Use this flag to ignore case when matching. For example, /'foo'; i/ matches 'FOO', but /'foo'/ doesn't.

reg_newline (Multiline)

With this flag, %end will match before a newline and %start will match after a newline.

= u'abc123\n' ~ / digit %end ; reg_newline /    # true
= u'abc\n123' ~ / %start digit ; reg_newline /  # true

Without the flag, %start and %end only match from the start or end of the string, respectively.

= u'abc123\n' ~ / digit %end /                  # false
= u'abc\n123' ~ / %start digit /                # false

Newlines are also ignored in dot and ![abc] patterns.

= u'\n' ~ / . /                                 # true
= u'\n' ~ / !digit /                            # true

Without this flag, the newline \n is treated as an ordinary character.

= u'\n' ~ / . ; reg_newline /                   # false
= u'\n' ~ / !digit ; reg_newline /              # false

Multiline Syntax

You can spread regexes over multiple lines and add comments:

var x = ///
  digit{4}   # year e.g. 2001
  '-'
  digit{2}   # month e.g. 06
  '-'
  digit{2}   # day e.g. 31
///

(Not yet implemented in YSH.)

The YSH API

See the YSH regex API for details.

In summary, YSH has Perl-like conveniences with an ~ operator:

var s = 'on 04-01, 10-31'
var pat = /<capture d+ as month> '-' <capture d+ as day>/

if (s ~ pat) {       # search for the pattern
  echo $[_group('month')]  # => 04
}

It also has an explicit and powerful Python-like API with the search() and leftMatch()` methods on strings.

var m = s => search(pat, pos=8)  # start searching at a position
if (m) {
  echo $[m => group('month')]  # => 10
}

Language Reference

Usage Notes

Use character literals rather than C-Escaped strings

No:

/ $'foo\tbar' /   # Match 7 characters including a tab, but it's hard to read
/ r'foo\tbar' /   # The string must contain 8 chars including '\' and 't'

Yes:

# Instead, Take advantage of char literals and implicit regex concatenation
/ 'foo' \t 'bar' /
/ 'foo' \\ 'tbar' /

POSIX ERE Limitations

Repetition of Strings Requires Grouping

Repetitions like * + ? apply only to the last character, so literal strings need extra grouping:

No:

'foo'+ 

Yes:

<capture 'foo'>+

Also OK:

('foo')+  # this is a CAPTURING group in ERE

This is necessary because ERE doesn't have non-capturing groups like Perl's (?:...), and Eggex only does "dumb" translations. It doesn't silently insert constructs that change the meaning of the pattern.

Unicode char literals are limited in range

ERE can't represent this set of 1 character reliably:

/ [ \u{0100} ] /      # This char is 2 bytes encoded in UTF-8

These sets are accepted:

/ [ \u{1} \u{2} ] /   # set of 2 chars
/ [ \x01 \x02 ] ] /   # set of 2 bytes

They happen to be identical when translated to ERE, but may not be when translated to PCRE.

Don't put non-ASCII bytes in string sets in char classes

This is a sequence of characters:

/ $'\xfe\xff' /

This is a set of characters that is illegal:

/ [ $'\xfe\xff' ] /  # set or sequence?  It's confusing

This is a better way to write it:

/ [ \xfe \xff ] /  # set of 2 chars

Char class literals: ^ - ] \

The literal characters ^ - ] \ are problematic because they can be confused with operators.

The Eggex-to-ERE translator is smart enough to handle cases like this:

var pat = / ['^' 'x'] / 
# translated to [x^], not [^x] for correctness

However, cases like this are a fatal runtime error:

var pat1 = / ['a'-'^'] /
var pat2 = / ['a'-'-'] /

Critiques

Regexes Are Hard To Read

... because the same symbol can mean many things.

^ could mean:

\ is used in:

? could mean:

With egg expressions, each construct has a distinct syntax.

YSH is Shorter Than Bash

Bash:

if [[ $x =~ '[[:digit:]]+' ]]; then
  echo 'x looks like a number
fi

Compare with YSH:

if (x ~ /digit+/) {
  echo 'x looks like a number'
}

... and Perl

Perl:

$x =~ /\d+/

YSH:

x ~ /d+/

The Perl expression has three more punctuation characters:

Design Notes

Eggexes In Other Languages

The eggex syntax can be incorporated into other tools and shells. It's designed to be separate from YSH -- hence the separate name.

Notes:

Backward Compatibility

Eggexes aren't backward compatible in general, but they retain some legacy operators like ^ . $ to ease the transition. These expressions are valid eggexes and valid POSIX EREs:

.*
^[0-9]+$
^.{1,3}|[0-9][0-9]?$

FAQ

The Name Sounds Funny.

If "eggex" sounds too much like "regex" to you, simply say "egg expression". It won't be confused with "regular expression" or "regex".

How Do Eggexes Compare with Raku Regexes and the Rosie Pattern Language?

All three languages support pattern composition and have quoted literals. And they have the goal of improving upon Perl 5 regex syntax, which has made its way into every major programming language (Python, Java, C++, etc.)

The main difference is that Eggexes are meant to be used with existing regex engines. For example, you translate them to a POSIX ERE, which is executed by egrep or awk. Or you translate them to a Perl-like syntax and use them in Python, JavaScript, Java, or C++ programs.

Perl 6 and Rosie have their own engines that are more powerful than PCRE, Python, etc. That means they cannot be used this way.

What About Eggex versus Parsing Expression Grammars? (PEGs)

The short answer is that they can be complementary: PEGs are closer to parsing, while eggex and regular languages are closer to lexing. Related:

The PEG model is more resource intensive, but it can recognize more languages, and it can recognize recursive structure (trees).

Why Don't dot, %start, and %end Have More Precise Names?

Because the meanings of . ^ and $ are usually affected by regex engine flags, like dotall, multiline, and unicode.

As a result, the names mean nothing more than "however your regex engine interprets . ^ and $".

As mentioned in the "Philosophy" section above, eggex only does a superficial, one-to-one translation. It doesn't understand the details of which characters will be matched under which engine.

Where Do I Send Feedback?

Eggexes are implemented in YSH, but not yet set in stone.

Please try them, as described in this post and the README, and send us feedback!

You can create a new post on /r/oilshell or a new message on #oil-discuss on https://oilshell.zulipchat.com/ (log in with Github, etc.)

Generated on Sun, 28 Jul 2024 06:21:02 +0000