Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
269 views
in Technique[技术] by (71.8m points)

python - How do I specify a range of unicode characters

How do I specify a range of unicode characters from ' ' (space) to u00D7FF?

I have a regular expression like r'[u0020-u00D7FF]' and it won't compile saying that it's a bad range. I am new to Unicode regular expressions so I haven't had this problem before.

Is there a way to make this compile or a regular expression that I'm forgetting or haven't learned yet?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The syntax of your unicode range will not do what you expect.

  1. The raw r'' string prevents u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-]:

    >>> re.compile(r'[u0020-u00d7ff]', re.DEBUG)
    in
      literal 117
      literal 48
      literal 48
      literal 50
      range (48, 117)
      literal 48
      literal 48
      literal 100
      literal 55
      literal 102
      literal 102
    
  2. Making it a Unicode literal causes u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is uxxxx or Uxxxxxxxx, so it’s parsed as "u00d7, f, f".

    >>> re.compile(ur'[u0020-u00d7ff]', re.DEBUG)
    in
      range (32, 215)
      literal 102
      literal 102
    
  3. Removing the leading zeroes or switching to U0000d7ff will fix it:

    >>> re.compile(ur'[u0020-ud7ff]', re.DEBUG)
    in
      range (32, 55295)
    

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...