For a while now I have been working on a developing a wiki application, both for personal projects and for use by clients. As part of this I needed to implement a search feature, allowing users to search the content on the wiki. At this point there were a few options open to me. All the content was stored in a MySQL database, so I could simply use a ‘LIKE’ statement against all the stored content, but there are several problems with this method:
- It is not very efficient, although this could be improved by using MySQL’s full text indexing (which requires the MyISAM table type). I also wanted to keep the application RDBMS independent if possible.
- There would be very little control over content matching. For example, HTML tags and attributes would be matched, which may not be desired in your application
- It would not be possible to perform context sensitive weighting. For example, a match found inside a <ht> tag should be more significant than a match found in the body text
In the end I realised I needed to develop my own searching system. My immediate thought was just to pull all the content from the database and run through it all using something like soundex() or levenshtein(). Although simple, this had the obvious problem of having to fetch every page from the database. Take the Ubuntu Wiki for example, with nearly 27,000 pages (at the time of writing), and one can conservatively estimate that each page is around 1KiB in size. So that means you need to transfer 27MiB of data from the database and process per search. Ouch!
The Solution
So now we can see that a brute force method won't work, lets try something a little smarter. If you haven't done so already, have a look at the soudex() function. As you can see, the function takes a word and generates a four character code that represents how it is pronounced in English (although it may work acceptably for other western languages). Now if we spider each page's content when it is changed (or created), and apply this to each word we find on that page, we can populate a 'words' table:
It is worth noting that we have a created an index on the soundex column - this will greatly speed things up later.
We link the words table to our wiki pages using another table:
At this point I should say that this method is different to directly searching the content using ‘LIKE’ because we are actually spidering the content to store it in a more search-friendly form. This makes searching much more efficient, but does require a little extra storage (for the additional tables).
So when we spider a page we strip out all the HTML tags and anything else we don't want (such as very common words). We insert each of the remaining words into the words table (as long as they are not already in there), generating the value for the soundex column by using the soundex() PHP function. After this is done we need to find the wordid of the word we just inserted (or if the word already exists, then the ID of that word) and create a corresponding row in the wordwikipagelink table as follows:
- Take the ID of the word we just added (or if it already existed, then the ID of that word in the table.)
- If this word is already listed in the wordwikipagelink for the current page ID then just increment the count column.
- If this word is not listed for the current page ID then add a new row and set the count to one.
Using this method we can maintain an index of all the content in our wiki (or blog, or CMS. The importing point is that you understand how the concept works)
But my content isn't in a database!
This is not a problem, if your content is in static files you can simply replace the wikipageid column in the wordwikipagelink table with somthing that you use to uniquely identify your files, such as the path and file name. Then, instead of bringing content from the database, you can just use file_get_contents() on the file and then proceed as normal. You can also extend this method to index content on remote sites (for whatever reason) by using file_get_contents() on a URL. Or maybe more usefully, you can use this method to index content on your own site that you may not be able to easily extract from it's storage medium, or is highly dynamic (in which case, make sure you reindex frequently).
Part 2 will cover how to actually perform the search, with some help from levenshtein(). If you cannot wait, checkout this spiderPage() method, as well the SeachModel class. I developed both of these for the Wiki package.
Trackbacks