Friday 11 April 2008

C# Lucene.NET - Part 2

Part 2 - Revenge of the HTML IFilter

So here's me, sat at my desk, happily indexing away using my nice little indexing server, and the Boss says - "all our html page titles are the same, and I don't want the file name to be shown as the title in the search results, but we do use the meta descripions a lot to detail what the page is about, can we use that?", "of course we can" I replied.

So, I go back into my indexing program, and implement a HTML Parser, and I add a new field to my Lucene document indexes when they are created, and I want to save the Meta tag called "content". I set my indexer off again, and great, no errors. I perform a search, but results come back for that field. I run it debug mode, and sure enough, when indexing, my HTML Parser couldn't pickup the HTML tag. Looking at the raw body of what the IFilters were doing, the HTML files were being passed into nlhtml.dll for filtering. Having read previously about IFilters and especially nlhtml.dll being used predominently for indexing HTML pages, I was really bemused as to why I couldn't gain access to the HTML tags - why it wouldn't let me get at them.

After two days, it dawned on me - that is exactly what an IFilter was supposed to do! the nlhtml.dll IFilter removed the "unnecessary" formatting from around the HTML Page, and only returned back to me what it thought was necessary, the content from within the tags - damn it! I need the META tags!

How did I overcome this you may ask (if you're interested that is) - well, I decided that I would supply a list to my search engine of files that I wanted to be considered as Plaintext - i.e. they needed to some "other" processing done on them in addition to being passed to a body content filtering process (in hindsight, I wish i'd called them html documents, rather than Plaintext, as it is more correct).

So here is what I cam up with - maybe not the most elegant, but by god it works! When it finds a file with the extension of a known Plaintext file, it still does the normal content body search, but it also does some specific searches for known html tags:

public Document LuceneDocument
{
get
{
Document doc = new Document();
doc.Add(Field.Keyword("name", fi.Name));
doc.Add(Field.Keyword("fullname", fi.FullName));
DirectoryInfo di = fi.Directory;
doc.Add(Field.Keyword("directparent", di.FullName));
while (di != null)
{
doc.Add(Field.Keyword("parent", di.FullName));
di = di.Parent;
}
doc.Add(Field.Keyword("created", DateField.DateToString(fi.CreationTime)));
doc.Add(Field.Keyword("modified", DateField.DateToString(fi.LastWriteTime)));
doc.Add(Field.Keyword("accessed", DateField.DateToString(fi.LastAccessTime)));
doc.Add(Field.Keyword("length", fi.Length.ToString()));
doc.Add(Field.UnIndexed("extension", fi.Extension));

//We need to know if this is a plaintextfile search - looking for meta info etc
//mainly used for htm, html, and asp files. This is set in config.xml
ArrayList sPlainTextFiles = new ArrayList(0);
SearchConfiguration cfg = null;
try
{
cfg = SearchConfiguration.Load(Directory.GetParent(Assembly.GetExecutingAssembly().Location) + "/config.xml");
if (cfg == null)
{
Log.Debug("Config file not found.");
}
else
{
sPlainTextFiles = cfg.PlainTextFiles;
}
}
catch (Exception e)
{
Log.Debug("Error loading the config file: " + e);
}
bool bPlainTextRequired = false;
for (int i = 0; i < extension ="=" bplaintextrequired =" true;" rawtext =" getBody();" plaintext = "" metacontent = "" metadescription = "" htmltitle = "" htmlh1 = "" plaintext =" getPlainBody();" metacontent =" getMetaContent(plainText);" metadescription =" getMetaDescription(plainText);" htmltitle =" getHtmlTitle(plainText);" htmlh1 =" getHtmlH1(plainText);" ret = ""> metaList = MetaParser.Parse(sText);
foreach (HtmlMeta meta in metaList)
{
if ((meta.Name.ToLower() == "title") && (meta.Content.Length > 0))
{
ret = meta.Content;
}
}
return ret;
}
private string getMetaDescription(string sText)
{
string ret = "";
List metaList = MetaParser.Parse(sText);
foreach (HtmlMeta meta in metaList)
{
if ((meta.Name.ToLower() == "description") && (meta.Content.Length > 0))
{
ret = meta.Content;
}
}
return ret;
}
private string getHtmlTitle(string sText)
{
string strOut = Regex.Match(sText,"(?<=title).*?(?=/title)", RegexOptions.IgnoreCase).Groups[0].Value; return strOut; } private string getHtmlH1(string sText) { string strOut = Regex.Match(sText, "(?<= h1).*?(?=/h1)", RegexOptions.IgnoreCase).Groups[0].Value;
Regex rx = new Regex(@"<[^\>]+\>");
return rx.Replace(strOut, "");
}

Please note that in the above regex, i've had to remove the <> from around the h1 and title tags, as it made the document go skew-wiff when published!

Did I miss something? Yes - getPlainBody() - well again, not very elegant, but after all, all we need if the plain text isn't it? So here is getPlainBody():

private string getPlainBody()
{
//If we need plain text, i.e. No IFiltering, then do this.
Log.Echo("Calling default parser for " + fi.FullName);
return PlainTextParser.Extract(fi.FullName);
}

And all the PlainTextParser is, is:


public class PlainTextParser
{
public PlainTextParser()
{
}

public static string Extract(string path)
{
string strRet = "";
try
{
StreamReader sr = new StreamReader(path);
strRet = sr.ReadToEnd();
//Remove carriage returns...
strRet.Replace("\r\n", "");
sr.Close();
}
catch (Exception ex)
{
Log.Echo("Plain Text Parser failed - " + ex.Message + ". Invoking default parser.");
strRet = DefaultParser.Extract(path);
}
return strRet;
}
}

Simple huh?

Well, if you have had the same problems as me, I hope this helps.

2 comments:

  1. hi, i am working on a similiar kind of stuff , but right researching a bit on parsing C# source , aspx and html files. from citeknet ifilter explorer it is well evident that for parsing these files you require nlhtml.dll.

    i did dumpbin of nlhtml only to find four functions.
    are u using the loadifilter methods of query.dll to do that. and it also doesn't support IPersistFile for loading the document.

    can you let me know how do we actually load the ifilter.

    thanks ...

    ReplyDelete
  2. Sorry it's taken so long to reply, when I rebuilt my PC, I forgot to add my gmail account to my outlook!

    If you take a look inside the guts of seekafileserver, you'll see that it handles the IFilter for you by interfacing with COM to gain access to the methods.

    It's in the namespace Seekafile.Server.IFilter

    If you wanted to do your own, you can just take a copy of the COM interop from this, and work with the Extract method to do what you need to.

    The thing to remember with the IFilters is that which IFilter is required, is handled by COM, so the important thing is to have the right IFilters installed, and the subsystem does all the dirty work for you.

    ReplyDelete