Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
210 views
in Technique[技术] by (71.8m points)

java - Fast algorithm for searching for substrings in a string

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

What I would like to do is:

Given an input string - INSTR:

"BCDEFGH"

And a set of candidate strings - CAND:

"AB", "CDE", "FG", "H", "IJ"

Find any CAND strings that match as substrings within INSTR

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.


Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.

Update I only need to know that there was a match - and i don't need to know what the match was.

Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:

AhoCorsick: 1

RabinKarp: 1.8

Naive Search (check each pattern & use string.contains): 50


*Some resources describing the algos mentioned in the answers below:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Read up on the Aho-Corasick algorithm and the Rabin-Karp algorithm.

If the input is not too large, you don't want to repeat the search many times and you do not have many patterns, it might be a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times.

Implementations:

Presentations:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...