Your rule is too broad, so it can match strings in multiple different ways. The way that you were hoping it would match isn't necessarily the one that the regular expression engine will actually process.
First, let's break down your pattern ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$
into the parts the engine will process:
^
start of string
[a-zA-Z-/]
a lower-case letter, an upper-case letter, a hyphen -
or a slash /
([a-zA-Z-/]{2})
the above must match exactly 2 characters, which will be captured as $1
/
a literal slash, not optional, not captured
([a-zA-Z-/]+)
the same set of characters as earlier; this time required to match one or more times (+
); captured as $2
/club
the literal string /club
, not optional, not captured
/?
a literal slash, optional (specifically, ?
means must occur zero or one times)
[a-zA-Z0-9-]
a lower-case letter, an upper-case letter, a digit, or a hyphen -
([a-zA-Z0-9-]+)
the above must match one or more times; captured as $3
([a-zA-Z0-9-]+)?
the above capture group as a whole is optional
/?
a literal slash, optional
$
end of string
Next, look at how this matches a URL, starting with the one which works how you hoped (tx/travis/club/fast-cars-club
, since the mysite.com/
is processed separately):
- the
^
indicates that we can't throw anything away at the start of the string
tx
matches ([a-zA-Z-/]{2})
and goes into $1
/
matches
([a-zA-Z-/]+)
could match the whole of travis/club/fast-cars-club
but this leaves nothing for the rest of the pattern to match.
- The regex engine now applies "back-tracking": it tries shorter matches until it finds one that matches more of the pattern. In this case, it finds that if it takes just
travis
and puts it in $2
, it can match the mandatory /club
which comes next
/club
is followed by /
, so /?
matches
fast-cars-club
matches [a-zA-Z0-9-]+
, so is captured into $3
- we've used the whole input string, so
$
succeeds
Now look at the "misbehaving" string, tx/travis/club/club-fast-cars
:
- the
^
indicates that we can't throw anything away at the start of the string
tx
matches ([a-zA-Z-/]{2})
and goes into $1
/
matches
([a-zA-Z-/]+)
could match the whole of travis/club/club-fast-cars
but this leaves nothing for the rest of the pattern to match.
- While "back-tracking", the regex engine tries putting
travis/club
into $2
; this is followed by another /club
, so the match succeeds
- there is no following
/
, but that's fine: /?
can match zero occurrences
- the remainder of the string,
-fast-cars
matches [a-zA-Z0-9-]+
, so is captured into $3
- we've used the whole input string, so
$
succeeds
This behaviour of "greediness" and "back-tracking" is a key one to understanding complex regular expressions, but most of the time the solution is simply to make the regular expression less complex, and more specific.
Only you know the full rules you want to specify, but as a starting point, let's make everything mandatory:
- exactly two letters (the state)
[a-zA-Z]{2}
/
- one or more letters or hyphens (the county)
[a-zA-Z-]+
/
- the literal word
club
/
- one or more letters or hyphens (the title)
[a-zA-Z-]+
/
Adding parentheses to capture the three parts gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club/([a-zA-Z-]+)/$
Now we can decide to make some parts optional, remembering that the more we make optional, the more ways there might be to re-interpret a URL.
We can probably safely make the trailing /
optional. Alternatively, we can have a separate rule that matches any URL without a trailing /
and redirects to a URL with it added on (this is quite common to allow both forms but reduce the number of duplicate URLs in search engines).
If we wanted to allow mysite.com/tx/travis/
in addition to mysite.com/tx/travis/club/club-fast-cars/
we could make the whole /club/([a-zA-Z-]+)
section optional: ^([a-zA-Z]{2})/([a-zA-Z-]+)(/club/([a-zA-Z-]+))?/$
Note that the extra parentheses capture an extra variable, so what was $3
will now be $4
.
Or maybe we want to allow mysite.com/tx/travis/club/
, in which case we would make /([a-zA-Z-]+)
optional - note that we want to include the /
in the optional part, even though we don't want to capture it. That gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club(/([a-zA-Z-]+))?/$
The two things we almost certainly don't want, which you had are:
- Allowing
/
inside any of the character ranges; keep it for separating components only unless you have a really good reason to allow it elsewhere.
- Making
/
optional in the middle; as we saw, this just leads to multiple ways of matching the same string, and makes the whole thing more complicated than it needs to be.