Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
456 views
in Technique[技术] by (71.8m points)

regex - Javascript RegExp + Word boundaries + unicode characters

I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ?, ? and ?

When user types text in to the search input field I try to match the text to data.

Here is simple example that is not working correctly if user types for example "??". Same thing with "?l"

var title = "this is simple string with finnish word t?m? on ??kk?stesti ?lk?? ihmetelk?";
// Does not work
var searchterm = "?l";

// does not work
//var searchterm = "??";

// Works
//var searchterm = "wi";

if ( new RegExp("\b"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

http://jsfiddle.net/7TsxB/

So how can I get those ?,? and ? characters to work with javascript regex?

I think I should use unicode codes but how should I do that? Codes for those characters are: [u00C4,u00E4,u00C5,u00E5,u00D6,u00F6]

=> ??????

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There appears to be a problem with Regex and the word boundary matching the beginning of a string with a starting character out of the normal 256 byte range.

Instead of using , try using (?:^|\s)

var title = "this is simple string with finnish word t?m? on ??kk?stesti ?lk?? ihmetelk?";
// Does not work
var searchterm = "?l";

// does not work
//var searchterm = "??";

// Works
//var searchterm = "wi";

if ( new RegExp("(?:^|\s)"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

Breakdown:

(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together

^ the caret symbol matches the beginning of a string

| the bar is the "or" operator.

s matches whitespace (appears as \s in the string because we have to escape the backslash)

) closes the group

So instead of using , which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...