2023-07-09

Debugging Perl 6 books while learning

Spotted confusing errors from the "parsing" book

Errata

  • (2023-07-09) The following factual errors are very unfortunate when I am trying to unravel this whitespace tangle.

If the predefined <ws> construct is equivalent in code as

regex ws { <!ww> \s*}

then clearly it matches zero to many different whitespace characters unless it's within a word. The two fatal errors are: "at least one" and "unless it's at a word boundary".

!500

Even more shocking errors from the "learning" book

Errata

First, <|wb>, according to official documentation, is wrong. It should be

<?wb> or the hard-to-understand <|w>.

Second, the comment "required between word characters, optional otherwise" is just plain wrong. It's confusing language. A word character should be [a-zA-Z0-9_] or any Unicode letter. If we take "Hello world!", "between word characters" therefore would mean any point where I added a ^:

H^e^l^l^o w^o^r^l^d!

Those points aren't places where whitespace is required. Quite the opposite.

What the author means to say is "between words" or more verbosely, "between a word character and a non-word character." If we ignore both ends, there are exactly two points where <ws> can occur:

Hello^world^!

!450

In terms of confusing writing, seriously I'm appalled, but one instance is forgivable. Will watch out for more.

Setting out to get to the bottom of whitespace

After the horrible findings above, I set out to gain absolute clarify.

$

Perl 6 Notes

Whitespace (ws)

  • token and rule are different from regex by adding ratcheting (non-backtracking).
  • rule also inserts <.ws> ONLY where there is a literal space, specifically:
    • After terms and other named things
    • But not at the beginning of the rule #todo/verify

An over-the-top but super clarifying example

To absolutely understand ws handling, here's some actual code (fn1) to confirm what I suspect to be true.

my token tn {<alnum>}  
my token t {\d \d}  
my token t_anchored {^ \d \d $}  
my token t_trailing_backslash_s {\d \d\s}  
my rule r {\d  \d}                        # two spaces  
my rule r_nospace {\d\d}  
my rule r_nospace_from_named {<tn><tn>}  
my rule r_space_from_named {<tn> <tn>}  
my rule r_training_space {\d  \d }        # two spaces 
my rule r_training_backslash_s {\d  \d\s} # two spaces 

Output:

Each of these 16 cases has been confirmed.

「42」 matched token t {\d \d}
「 42」 matched token t {\d \d}
「 42」 does not match token t_anchored {^ \d \d $}
「4 2」 does not match token t {\d \d}
「42」 does not match token t_trailing_backslash_s {\d \d\s}
「42」 does not match rule r {\d  \d}
「42」 matched rule r_nospace {\d\d}
「4 2」 does not match rule r_nospace {\d\d}
「4 2」 does not match rule r_nospace_from_named {<tn><tn>}
「42」 does not match rule r_space_from_named {<tn> <tn>}
「4 2」 matched rule r {\d  \d}
「4  2」 matched rule r {\d  \d}
「4   2」 matched rule r {\d  \d}
「 4   2」 matched rule r {\d  \d}
「 4 2」 matched rule r_training_space {\d  \d }
「 4 2」 does not match rule r_training_backslash_s {\d  \d\s}

fn1: I had to use the .gist method and the & method reference in &t.gist to print token t {\d \d}. I hope I'll learn an easier way to stringify a regex object.