Fuzzy Searching in PHP: Part 2
Friday, March 7. 2008Comments
Display comments as
(Linear | Threaded)
Thanks for this quick but useful series on fuzzy matching.
While order could be implemented solely by the admin, we should consider letting the user do his own weighting, too. A user looking for references may want pages by relevance (weighting), while a user tracking down a friend's recommendation may want pages by popularity.
With an additional column in the wordpagelink table, we could also allow users to restrict their searches to entries or comments.
While order could be implemented solely by the admin, we should consider letting the user do his own weighting, too. A user looking for references may want pages by relevance (weighting), while a user tracking down a friend's recommendation may want pages by popularity.
With an additional column in the wordpagelink table, we could also allow users to restrict their searches to entries or comments.
Hi Judebert,
Welcome to port:eightyeight!
Those are some excellent points. More information stored during the spidering process can translate into more flexibility for the user.
As you said, extra columns in the wikiwordslink table could indicate where the match was found, such as body copy, headings, comments, profiles etc. (would just depend on the application).
Alternatively you could have an ENUM field (in MySQL) in the wikiwordslink table which indicates where the match was found, and a unique key across all three columns. This may end up using less space (on the basis that the majority of words will be body copy).
Anyway, I think comes down to a tradeoff between what your users want and what server load you can handle. As always, a middle-ground could be in order
Thanks for commenting!
Adam
Welcome to port:eightyeight!
Those are some excellent points. More information stored during the spidering process can translate into more flexibility for the user.
As you said, extra columns in the wikiwordslink table could indicate where the match was found, such as body copy, headings, comments, profiles etc. (would just depend on the application).
Alternatively you could have an ENUM field (in MySQL) in the wikiwordslink table which indicates where the match was found, and a unique key across all three columns. This may end up using less space (on the basis that the majority of words will be body copy).
Anyway, I think comes down to a tradeoff between what your users want and what server load you can handle. As always, a middle-ground could be in order
Thanks for commenting!
Adam
This is brilliant. I have seen this method before with MusicBrainz[1], but it doesn't use soundex.
As I understand it, soundex wouldn't match 'run' on 'runner' or 'running'. I don't quite understand how you would use levenshtein, but I think it would help in this case?
What do you recommend?
1. http://wiki.musicbrainz.org/DatabaseSchema/SearchTables
As I understand it, soundex wouldn't match 'run' on 'runner' or 'running'. I don't quite understand how you would use levenshtein, but I think it would help in this case?
What do you recommend?
1. http://wiki.musicbrainz.org/DatabaseSchema/SearchTables
Hi Daniel,
Thanks for reading, and I am please you like it!
I think to answer your question fully I would need to spend a little more time reading up on the inner workings of the soundex algorithm - but unfortunately it is coming up to midnight and I need sleep. However, http://en.wikipedia.org/wiki/Soundex looks like a good starting point.
My initial reaction would be to say that you could do a range match against the soundex field. You could code something like "If the first letter of the soundex matches then accept any values within +/- 60[1] of the generated soundex). You could then apply some sort of weighting based on how 'off' the soundex is. You would probably need split the soundex column in the words table into two columns for this to be efficient (on for the letter and one for the number, with an index across both).
I hope this answers your question. Please let me know if you make any progress with this!
Best,
Adam
[1] run = R552, running = R560, runner = R500.
Thanks for reading, and I am please you like it!
I think to answer your question fully I would need to spend a little more time reading up on the inner workings of the soundex algorithm - but unfortunately it is coming up to midnight and I need sleep. However, http://en.wikipedia.org/wiki/Soundex looks like a good starting point.
My initial reaction would be to say that you could do a range match against the soundex field. You could code something like "If the first letter of the soundex matches then accept any values within +/- 60[1] of the generated soundex). You could then apply some sort of weighting based on how 'off' the soundex is. You would probably need split the soundex column in the words table into two columns for this to be efficient (on for the letter and one for the number, with an index across both).
I hope this answers your question. Please let me know if you make any progress with this!
Best,
Adam
[1] run = R552, running = R560, runner = R500.
Great post! I have been dreading writing a search method for a project I'm working on and it appears that soundex and levenshtein will work perfectly. Thank you for breaking down fuzzy searching into such basic terms.
about.this.blog
This blog is written by Adam Charnock, a freelance PHP developer based in London, England.


Trackbacks