Skip to content

Conversation

@DylanLukes
Copy link

In the process of using Baron for some research on source pulled from hundreds/thousands of repositories on GitHub, I've found that in many cases Baron is unable to tokenize/parse source containing non-ASCII identifiers.

Non-ASCII identifiers are supported by (at least as far back as) Python 3, as specified by PEP3131.

This pull request includes some very small changes that allow Baron to handle non-ASCII identifiers:

  • Replace native re module with a dependency on the regex module.
    • This is because regex supports Unicode character property classes.
  • Replace the regex for NAME tokens:
    • Before: [a-zA-Z_]\w*
    • After: [\p{XID_Start}_]\p{XID_Continue}*

I have checked that all tests pass without regression, and have added another simple test:

def test_name_unicode():
    match('β', 'NAME')
    match('가사', 'NAME')

Note:

PEP3131 states:

The identifier syntax is <XID_Start> <XID_Continue>*.

However, this seems to be an error, as XID_Start does not contain _ by default (though the Unicode specifications suggest a Start class could or should contain it.

@DylanLukes
Copy link
Author

Looks like there's a failing test on 2.7, will fix.

The 2.6 failure is unrelated to this PR:

0.10s$ curl -sSf --retry 5 -o python-2.6.tar.bz2 ${archive_url}
163curl: (22) The requested URL returned error: 404 Not Found

@DylanLukes
Copy link
Author

Alright, tests now all pass on 2.7 and up! I ended up making them conditional on the Python version, as it turns out the derived Unicode categories differ between Python 2 and Python 3.

That is "α" is matched by "\p{XID_Start}" on Python 3, but not on Python 2.

In summary: this set of changes adds support for Python 3's Unicode identifiers... but only if you're using Python 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants