Fuzzy Searching in PHP: Part 2

squirrel

In part two we looked at how we spider our content and how that content can be stored in way which allows it to be searched. In part 2 I will show you how to actually perform searches on this indexed data.

The basic process can be summarised as follows:

  1. Get a list of keywords to search for (i.e. the search terms to user inputs)
  2. Generate the soundex of each keyword
  3. Select all the rows in the words table (which we can join onto the wordpagelink) which match any of the soundex codes we generated

This will provide us with a very basic list of pages which contain words that match (or sound like) the keywords the user specified.

A More Advanced Search

Although this method does provide us with a fuzzy search it does not give us any degrees of relevancy for each result, so we cannot display the results in any sensible order.

There are several techniques we can use to rank our search results. Firstly, we can use the count field in the wordpagelink to find how many matches are found on each page for each keyword. A large number of matches can increase the relevance.

Another method is to use the levenshtein() function to find the difference between the keywords and the matched words. This is a nice way of combing the speed of the soudex lookup field in our words table with the granularity of the levenshtein difference. Note that the greater the value returned by levenshtein() the less the words match.

It is also possible to alter the weighting of matched keywords based upon where the match occurs in the page content (although this needs to be done during the indexing process). Regular expressions could be used to extract text from headline and title tags. These words are generally more significant and so can be given a higher weighting through the use of the count column; a word found in a h1 or title tag could be given a count of 10, and a word found in a h2 tag could be given a count of 7. You may find it useful to tweak these values for your particular application.

Another interesting method (which I have yet to explore) would be to weight whole pages based upon factors such as traffic, or how many times the page is linked to from other pages in the same site. These options would require more processing during the indexing process, but could ultimately provide very useful results for larger sites.

However, all of these methods need a certain amount of fine-tuning. You will need to play around with the weighing factors until you find you get accurate results, but it should ultimately prove worthwhile. Don’t forgot that that as you increase the number of factors that affect your weightings it will become harder to know exactly which ones to change to get the ideal search results.

2 comments so far

  1. tanin on

    nice but better if show the code

    • Adam Charnock on

      Hi Tanin, thanks for the comment. I kept the code pretty sketchy as my focus was on the general concepts, I hope you found it useful


Leave a reply