Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
130 views
in Technique[技术] by (71.8m points)

.htaccess - RewriteRule cuts off part of a variable name

I have a RewriteRule inside my .htaccess file:

RewriteRule ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ /incl/pages/seo.club.php?state=$1&county=$2&title=$3 [NC,L]

For most cases it works fine, however, if the title starts with the word "club" that word is cut off.

For example, if the name is fast-cars-club the $_GET['title'] will be unchanged, as desired, however if the slug is club-of-fast-cars the $_GET['title'] will output -of-fast-cars

In the following URL:

mysite.com/tx/travis/club/fast-cars-club

$_GET['title'] == 'fast-cars-club'

But in the this URL:

mysite.com/tx/travis/club/club-fast-cars

$_GET['title'] == '-fast-cars'

What am I missing?

question from:https://stackoverflow.com/questions/65644023/rewriterule-cuts-off-part-of-a-variable-name

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your rule is too broad, so it can match strings in multiple different ways. The way that you were hoping it would match isn't necessarily the one that the regular expression engine will actually process.

First, let's break down your pattern ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ into the parts the engine will process:

  • ^ start of string
  • [a-zA-Z-/] a lower-case letter, an upper-case letter, a hyphen - or a slash /
  • ([a-zA-Z-/]{2}) the above must match exactly 2 characters, which will be captured as $1
  • / a literal slash, not optional, not captured
  • ([a-zA-Z-/]+) the same set of characters as earlier; this time required to match one or more times (+); captured as $2
  • /club the literal string /club, not optional, not captured
  • /? a literal slash, optional (specifically, ? means must occur zero or one times)
  • [a-zA-Z0-9-] a lower-case letter, an upper-case letter, a digit, or a hyphen -
  • ([a-zA-Z0-9-]+) the above must match one or more times; captured as $3
  • ([a-zA-Z0-9-]+)? the above capture group as a whole is optional
  • /? a literal slash, optional
  • $ end of string

Next, look at how this matches a URL, starting with the one which works how you hoped (tx/travis/club/fast-cars-club, since the mysite.com/ is processed separately):

  • the ^ indicates that we can't throw anything away at the start of the string
  • tx matches ([a-zA-Z-/]{2}) and goes into $1
  • / matches
  • ([a-zA-Z-/]+) could match the whole of travis/club/fast-cars-club but this leaves nothing for the rest of the pattern to match.
  • The regex engine now applies "back-tracking": it tries shorter matches until it finds one that matches more of the pattern. In this case, it finds that if it takes just travis and puts it in $2, it can match the mandatory /club which comes next
  • /club is followed by /, so /? matches
  • fast-cars-club matches [a-zA-Z0-9-]+, so is captured into $3
  • we've used the whole input string, so $ succeeds

Now look at the "misbehaving" string, tx/travis/club/club-fast-cars:

  • the ^ indicates that we can't throw anything away at the start of the string
  • tx matches ([a-zA-Z-/]{2}) and goes into $1
  • / matches
  • ([a-zA-Z-/]+) could match the whole of travis/club/club-fast-cars but this leaves nothing for the rest of the pattern to match.
  • While "back-tracking", the regex engine tries putting travis/club into $2; this is followed by another /club, so the match succeeds
  • there is no following /, but that's fine: /? can match zero occurrences
  • the remainder of the string, -fast-cars matches [a-zA-Z0-9-]+, so is captured into $3
  • we've used the whole input string, so $ succeeds

This behaviour of "greediness" and "back-tracking" is a key one to understanding complex regular expressions, but most of the time the solution is simply to make the regular expression less complex, and more specific.

Only you know the full rules you want to specify, but as a starting point, let's make everything mandatory:

  • exactly two letters (the state) [a-zA-Z]{2}
  • /
  • one or more letters or hyphens (the county) [a-zA-Z-]+
  • /
  • the literal word club
  • /
  • one or more letters or hyphens (the title) [a-zA-Z-]+
  • /

Adding parentheses to capture the three parts gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club/([a-zA-Z-]+)/$

Now we can decide to make some parts optional, remembering that the more we make optional, the more ways there might be to re-interpret a URL.

We can probably safely make the trailing / optional. Alternatively, we can have a separate rule that matches any URL without a trailing / and redirects to a URL with it added on (this is quite common to allow both forms but reduce the number of duplicate URLs in search engines).

If we wanted to allow mysite.com/tx/travis/ in addition to mysite.com/tx/travis/club/club-fast-cars/ we could make the whole /club/([a-zA-Z-]+) section optional: ^([a-zA-Z]{2})/([a-zA-Z-]+)(/club/([a-zA-Z-]+))?/$ Note that the extra parentheses capture an extra variable, so what was $3 will now be $4.

Or maybe we want to allow mysite.com/tx/travis/club/, in which case we would make /([a-zA-Z-]+) optional - note that we want to include the / in the optional part, even though we don't want to capture it. That gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club(/([a-zA-Z-]+))?/$

The two things we almost certainly don't want, which you had are:

  • Allowing / inside any of the character ranges; keep it for separating components only unless you have a really good reason to allow it elsewhere.
  • Making / optional in the middle; as we saw, this just leads to multiple ways of matching the same string, and makes the whole thing more complicated than it needs to be.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...