Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.9k views
in Technique[技术] by (71.8m points)

algorithm - MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard

I recently implemented the UDFs of the Damerau–Levenshtein algorithms into MySQL, and was wondering if there is a way to combine the fuzzy matching of the Damerau–Levenshtein algorithm with the wildcard searching of the Like function? If I have the following data in a table:

ID | Text
---------------------------------------------
1  | let's find this document
2  | let's find this docment
3  | When the book is closed
4  | The dcument is locked

I want to run a query that would incorporate the Damerau–Levenshtein algorithm...

select text from table where damlev('Document',tablename.text) <= 5;

...with a wildcard match to return IDs 1, 2, and 4 in my query. I'm not sure of the syntax or if this is possible, or whether I would have to approach this differently. The above select statement works fine in issolation, but is not working on individual words. I would have to change the above SQL to...

select text from table where 
 damlev('let's find this document',tablename.text) <= 5;

...which of course returns just ID 2. I'm hoping there is a way to combine the fuzzy and wildcard together if I want all records returned that have the word "document" or variations of it appearing anyway within the Text field.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In working with person names, and doing fuzzy lookups on them, what worked for me was to create a second table of words. Also create a third table that is an intersect table for the many to many relationship between the table containing the text, and the word table. When a row is added to the text table, you split the text into words and populate the intersect table appropriately, adding new words to the word table when needed. Once this structure is in place, you can do lookups a bit faster, because you only need to perform your damlev function over the table of unique words. A simple join gets you the text containing the matching words. enter image description here

A query for a single word match would look something like this:

SELECT T.* FROM Words AS W
JOIN Intersect AS I ON I.WordId = W.WordId
JOIN Text AS T ON T.TextId = I.TextId
WHERE damlev('document',W.Word) <= 5 

and two words would look like this (off the top of my head, so may not be exactly correct):

SELECT T.* FROM Text AS T
JOIN (SELECT I.TextId, COUNT(I.WordId) AS MatchCount FROM Word AS W
      JOIN Intersect AS I ON I.WordId = W.WordId
      WHERE damlev('john',W.Word) <= 2
            OR damlev('smith',W.Word) <=2
      GROUP BY I.TextId) AS Matches ON Matches.TextId = T.TextId
          AND Matches.MatchCount = 2

The advantages here, at the cost of some database space, is that you only have to apply the time-expensive damlev function to the unique words, which will probably only number in the 10's of thousands regardless of the size of your table of text. This matters, because the damlev UDF will not use indexes - it will scan the entire table on which it's applied to compute a value for every row. Scanning just the unique words should be much faster. The other advantage is that the damlev is applied at the word level, which seems to be what you are asking for. Another advantage is that you can expand the query to support searching on multiple words, and can rank the results by grouping the matching intersect rows on TextId, and ranking on the count of matches.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...