Friday 11 April 2008

C# Lucene.NET - Part 1

A few weeks ago I was asked to implement a new website search system, as our current Microsoft Indexing Server was becoming a little unreliable, and was frequently not updating documents when they had been saved. After a little searching around, I came accross Lucene.

Lucene was originally designed in Java, but since has been ported to several other langauges including perl, php, c++ and of course, .NET - however, the support and documentation for it in .NET is dire to say the least.

Lucene is purely the engine to allow you to Index files, and then peform searches against it, however, you will have to write all that yourself. Whilst looking for help, I came accross another open source project, called Seekafile, but again, all support for this has been revoked. I perservered with Seekafile though, as it was a useful tool. If you are looking for the basics of Lucene, you are probably going to want to look here, at the most .NET documentation for Lucene I was able to find (and prepare to be horrified).

My intention is to use this Blog to detail my trails and tribulations with Seekafile and Lucene, so if you are just starting out with it, I really hope that some of this will be of assistance to you!

So, part 1!

What is an IFilter...... and why should I care?

Seekafile is really useful, don't get me wrong, but I had problems from day 1 getting my head around what an IFilter was, and why the hell Seekafile needed them. Well, in simple terms, an IFilter is an interface to a component that has been pre-written for you, that is able to read a file of a certain type, and return back to you the text content that has been stripped of all the formatting and file nonsense that the host application requires. For example, a MS Word document has plenty of formatting around it, that tells word where, how, and why you have created a pretty looking table in the document for example. When indexing, and searching for that matter, are you interested in that pretty table? no, you just want to know what is inside it, so that is where the IFilter comes in.

You must know this, Indexing Service uses IFilters for everything, so I assuming, that pretty much all Indexing technology does as well.

So, IFilters are great huh? Yes, and no. Indeed they have many benefits, such as the fact that Windows natively only supports IFiltering of several key documents (effectively), and if you want to Index something more obscure you simply get another IFilter to help you do this. However, the disadvantage is that you have no control over how the IFilter does this, unless you write your own :-

If you are looking at IFilters, go here first http://msdn2.microsoft.com/en-us/library/ms692488(VS.85).aspx where you can find a really useful insight into how Windows decides what IFilter to use for what file type. Obviously, if you follow this, it simply shows you where the IFilter is located, and that then tells you what DLL is going to be used. A more useful tool is from Citeknet, which allows you to take an abstract view of IFilters, as it shows you what you have installed on your system, and which will filter which type of file - a useful overview if you ask me. Microsoft, specifically aimed at Indexing Service, do provide a little info on IFilters in articles http://msdn2.microsoft.com/en-us/library/ms692540(VS.85).aspx and http://msdn2.microsoft.com/en-us/library/ms692582(VS.85).aspx and from this, you will see that the good folks at Microsoft help you filter pretty much all their own stuff. Checkout the Citeknet site for some more obscure IFilters.

No comments:

Post a Comment