Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
175 views
in Technique[技术] by (71.8m points)

.net - Matching numbers with regular expressions — only digits and commas

I can't figure out how to construct a regex for the example values:

123,456,789
-12,34
1234
-8

Could you help me?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

What’s a Number?

I have a simple question for your “simple” question: What precisely do you mean by “a number”?

  • Is ?0 a number?
  • How do you feel about √?1?
  • Is ? or ? a number?
  • Is 186,282.42±0.02 miles/second one number — or is it two or three of them?
  • Is 6.02e23 a number?
  • Is 3.141_592_653_589 a number? How about π, or ?? And ?2π?3 ??
  • How many numbers in 0.083??
  • How many numbers in 128.0.0.1?
  • What number does ? hold? How about ???
  • Does 10,5 mm have one number in it — or does it have two?
  • Is ?83 a number — or is it three of them?
  • What number does ??????Ⅻ AUC represent, 2762 or 2009?
  • Are ???? and ???? numbers?
  • What about 0377, 0xDEADBEEF, and 0b111101101?
  • Is Inf a number? Is NaN?
  • Is ④② a number? What about ??
  • How do you feel about ?
  • What do ?? and ?? have to do with numbers? Or ?, ?, and ??

Suggested Patterns

Also, are you familiar with these patterns? Can you explain the pros and cons of each?

  1. /D/
  2. /^d+$/
  3. /^p{Nd}+$/
  4. /^pN+$/
  5. /^p{Numeric_Value:10}$/
  6. /^P{Numeric_Value:NaN}+$/
  7. /^-?d+$/
  8. /^[+-]?d+$/
  9. /^-?d+.?d*$/
  10. /^-?(?:d+(?:.d*)?|.d+)$/
  11. /^([+-]?)(?=d|.d)d*(.d*)?([Ee]([+-]?d+))?$/
  12. /^((d)(?(?=(d))|$)(?(?{ord$3==1+ord$2})(?1)|$))$/
  13. /^(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))$/
  14. /^(?:(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}))$/
  15. /^(?:(?:[+-]?)(?:[0123456789]+))$/
  16. /(([+-]?)([0123456789]{1,3}(?:,?[0123456789]{3})*))/
  17. /^(?:(?:[+-]?)(?:[0123456789]{1,3}(?:,?[0123456789]{3})*))$/
  18. /^(?:(?i)(?:[+-]?)(?:(?=[0123456789]|[.])(?:[0123456789]*)(?:(?:[.])(?:[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[0123456789]+))|))$/
  19. /^(?:(?i)(?:[+-]?)(?:(?=[01]|[.])(?:[01]{1,3}(?:(?:[,])[01]{3})*)(?:(?:[.])(?:[01]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[01]+))|))$/
  20. /^(?:(?i)(?:[+-]?)(?:(?=[0123456789ABCDEF]|[.])(?:[0123456789ABCDEF]{1,3}(?:(?:[,])[0123456789ABCDEF]{3})*)(?:(?:[.])(?:[0123456789ABCDEF]{0,}))?)(?:(?:[G])(?:(?:[+-]?)(?:[0123456789ABCDEF]+))|))$/
  21. /((?i)([+-]?)((?=[0123456789]|[.])([0123456789]{1,3}(?:(?:[_,]?)[0123456789]{3})*)(?:([.])([0123456789]{0,}))?)(?:([E])(([+-]?)([0123456789]+))|))/

I suspect that some of those patterns above may serve your needs. But I cannot tell you which one or ones — or, if none, supply you another — because you haven’t said what you mean by “number”.

As you see, there are a huge number of number possibilities: quite probably ?? worth of them, in fact. ?

Key to Suggested Patterns

Each numbered explanation listed below describes the pattern of the corresponding numbered pattern listed above.

  1. Match if there are any non-digits anywhere in the string, including whitespace like line breaks.
  2. Match only if the string contains nothing but digits, with the possible exception of a trailing line break. Note that a digit is defined as having the property General Category Decimal Number, which is available as p{Nd}, p{Decimal_Number}, or p{General_Category=Decimal_Number}. This is turn is actually just a reflection of those code points whose Numeric Type category is Decimal, which is available as p{Numeric_Type=Decimal}.
  3. This is the same as 2 in most regex languages. Java is an exception here, because it does not map the simple charclass escapes like w and W, d and D, s and S, and or B into the appropriate Unicode property. That means you must not use any of those eight one-character escapes for any Unicode data in Java, because they work only on ASCII even though Java always uses Unicode characters internally.
  4. This is slightly different from 3 in that it isn’t limited to decimal numbers, but can be any number at all; that is, any character with the pN, p{Number}, or p{General_Category=Number} property. These include p{Nl} or p{Letter_Number} for things like Roman numerals and p{No} or p{Other_Number} for subscripted and subscripted numbers, fractions, and circled numbers — amongst others, like counting rods.
  5. This matches only those strings composed entirely of numbers whose decimal value is 10, so things like the Roman numeral ten, and , , , ?, ?, ?, and ?.
  6. Only those strings that contain characters that lack the Numeric Value NaN; in other words, all chars must have some numeric value.
  7. Matches only Decimal Numbers, optionally with a leading HYPHEN MINUS.
  8. Same as 7 but now also works if the sign is plus instead of minus.
  9. Looks for decimal numbers, with optional HYPHEN MINUS and optional FULL STOP plus zero or more decimal numbers following.
  10. Same as 9 but doesn't require digits before the dot if it has some afterwards.
  11. Standard floating-point notation per C and many other languages, allowing for scientific notation.
  12. Finds numbers composed only of two or more decimals of any script in descending order, like 987 or 54321. This recursive regex includes a callout to Perl code that checks whether the look ahead digit has a code point value that is the successor of the current digit; that is, its ordinal value is one greater. One could do this in PCRE using a C function as the callout.
  13. This looks for a valid IPv4 address with four decimal numbers in the valid range, like 128.0.0.1 or 255.255.255.240, but not 999.999.999.999.
  14. This looks for a valid MAC addr, so six colon-separate pairs of two ASCII hex digits.
  15. This looks for whole numbers in the ASCII range with an optional leading sign. This is the normal pattern for matching ASCII integers.
  16. This is like 15, except that it requires a comma to separate groups of three.
  17. This is like 15, except that the comma for separating groups is now optional.
  18. This is the normal pattern for matching C-style floating-point numbers in ASCII.
  19. This is like 18, but requiring a comma to separate groups of 3 and in base-2 instead of in base-10.
  20. This is like 19, but in hex. Note that the optional exponent is now indicated by a G instead of an E, since E is a valid hex digit.
  21. This checks that the string contains a C-style floating-point number, but with an optional grouping separator every three digits of either a comma or an underscore (LOW LINE) between them. It also stores that string into the 1 capture group, making available as $1 after the match succeeds.

Sources and Maintainability

Patterns number 1,2,7–11 come from a previous incarnation of the Perl Frequently Asked Questions list in the question, “How do I validate input?”. That section has been replaced by a suggestion to use the Regexp::Common module, written by Abigail and Damian Conway. The original patterns can still be found in Recipe 2.1 of the Perl Cookbook, “Checking Whether a String Is a Valid Number”, solutions to which can be found for a dizzying number of diverse languages, including ada, common lisp, groovy, guile, haskell, java, merd, ocaml, php, pike, python, rexx, ruby, and tcl at the the PLEAC project.

Pattern 12 could be more legibly rewritten

m{
    ^
    (
        ( d )
        (?(?= ( d ) ) | $ )
        (?(?{ ord $3 == 1 + ord $2 }) (?1) | $ )
    )
    $
}x

It uses regex recursion, which is found in many pattern engines, including Perl and all the PCRE-derived languages. But it also uses an embedded code callout as the test of its second conditional pattern; to my knowledge, code callouts are available only in Perl and PCRE.

Patterns 13–21 were derived from the aforementioned Regexp::Common module. Note that for brevity, these are all written without the whitespace and comments that you would definitely want in production code. Here is how that might look in /x mode:

$real_rx = qr{ (   # start $1 to hold entire pattern
    ( [+-]? )                  # optional leading sign, captured into $2
    (                          # start $3
        (?=                    # look ahead for what next char *will* be
            [0123456789]       #    EITHER:  an ASCII digit
          | [.]                #    OR ELSE: a dot
        )                      # end look ahead
        (                      # start $4
           [0123456789]{1,3}       # 1-3 ASCII digits to start the number
           (?:                     # then optionally followed by
               (?: [_,]? )         # an optional grouping separator of comma or underscore
               [0123456789]{3}     # followed by exactly three ASCII digits
           ) *                     # repeated any number of times
        )                          # end $4
        (?:                        # begin optional cluster
             ( [.] )               # required literal dot in $5
             ( [0123456789]{0,} )  # then optional ASCII digits in $6
        ) ?                        # end optional cluster
     )                         # end $3
    (?:                        # begin cluster group
        ( [E] )                #   base-10 exponent into $7
        (                      #   exponent number into $8
            ( [+-] ? )         #     optional sign for exponent into $9
            ( [0123456789] + ) #     one or more ASCII digits into $10
        )                      #   end $8
      |                        #   or else nothing at all
    )                          # end cluster group
) }xi;          # end $1 and whole pattern, enabling /x and /i modes

From a software engineering perspective, there are still several issues with the style used in the /x mode version immediately above. First, there is a great deal of code repetition, where you see the same [0123456789]; what happens if one of those sequences accidentally leaves a di


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...