Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

string - Puzzle Solving: Finding All Words Within a Larger Word in PHP

So I have a database of words between 3 and 20 characters long. I want to code something in PHP that finds all of the smaller words that are contained within a larger word. For example, in the word "inward" there are the words "rain", "win", "rid", etc.

At first I thought about adding a field to the Words tables (Words3 through Words20, denoting the number of letters in the words), something like "LetterCount"... for example, "rally" would be represented as 10000000000200000100000010: 1 instances of the letter A, 0 instances of the letter B, ... 2 instances of the letter L, etc. Then, go through all the words in each table (or one table if the target length of found words was specified) and compare the LetterCount of each word to the LetterCount of the source word ("inward" in the example above).

But then I started thinking that that would place too much of a load on the MySQL database as well as the PHP script, calling each and every word's LetterCount, comparing each and every digit to that of the source word, etc.

Is there an easier, perhaps more intuitive way of doing this? I'm open to using stored procedures if it will help with overhead in any way. Just some suggestions would be greatly appreciated. Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is a simple solution that should be pretty efficient, but will only work up to certain size of words (probably about 15-20 characters it will break down, depending on whether the letters making up the word are low-frequency letters with lower values or high-frequency letters with higher values):

  1. Assign each letter a prime number according to it's frequency. So e is 2, t = 3, a = 5, etc. using frequency values from here or some similar source.
  2. Precalculate the value of each word in your word list by multiplying the prime values for the letters in the word, and store in the table in a bigint data type column. For instance, tea would have a value of 3*2*5=30. If a word has repeated letters, repeat the factor, so that teat should have a value of 3*2*5*3=90.
  3. When checking if a word, such as rain, is contained inside of another word, such as inward, it's sufficient to check if the value for rain divides the value for inward. In this case, inward = 14213045, rain = 7315, and 14213045 is divisible by 7315, so the word rain is inside the word inward.
  4. A bigint column maxes out at 9223372036854775807, which should be fine up to about 15-20 characters (depending on the frequencies of letters in the word). For instance, I picked up the first 20-letter word from here, which is anitinstitutionalism, and has a value of 6901041299724096525 which would just barely fit inside the bigint column. However, the 14-letter word xylopyrography has a value of 635285791503081662905, which is too big. You might have to handle the really large ones as special cases using an alternate method, but hopefully there's few enough of them that it would still be relatively efficient.

The query would work something like the demo I've prepared here: http://www.sqlfiddle.com/#!2/9bd27/8


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...