Things that make you go "Hmmmm": April 2008

Thursday, 24 April 2008

Automatically resizing an IFRAME based on it's content size

I had a case today when I must have spent 3 or 4 hours trying various elaborate exaples to try and get an IFrame to resize based upon the size of the target page of that IFrame. Many examples were suggesting have a method in the parent frame, that was to be called by the child frame, and in all instances it would just not work for me.

I found the following to work, and fortunately, it required no ammendments to the target page, just changes to the page that contained the IFRAME.

Firstly, a little snippet of JavaScript was need within the HEAD tags, as shown below:

function resizeMe(obj)
{
var docHeight = dynFrame.document.body.scrollHeight;
obj.style.height = docHeight + 'px';
}

Note that in < IE5.5, you should use offsetHeight rather than scrollHeight.

Then in an IFrame, add the attribute to trigger the resize when the iframe has loaded, as such:

onload="resizeMe(this);"

You should now have a resizing IFrame based on its content.

Thursday, 17 April 2008

Visual Studio .Net 2005 Hangs on Startup

Well, the subject of this post if what has been happening to me today. For no apparent reason, VS.NET 2005 would just hang at the splash screen. After trying to reinstall, it still would not load.

Here are some suggestions, collated from a morning of googling the forums:

1) Check that your harddrive has at least 1GB of space free
2) Try holding the shift key down when starting VS, so that you disable any addons that may have been corrupt.
3) Try installing the lastest .NET runtime, i.e. at present 3.5
4) As a last resort, trying clearing out the user environment settings - WARNING! This will erase all settings you have made in VS, so make a backup if you really need them, and only do this as a last resort. Go to the command prompt, and type "devenv.exe /resetuserdata"

If none of those work, you're pretty much stuffed I think! However, number 2 worked for me!

C# Lucene.NET - Sorting

I needed to perform a non-standard search in Lucene, basically a search that allowed me to order by something other than relevance - why? well, because the Boss wanted me to!

So, first of all, if you've been using Seekafile, you'll need to be aware that "out of the box", it comes with version 1.4 of Lucene. Verion 1.4 does not have the overloaded search method you need to supply a search to Lucene.

Firstly then, download the latest version of Lucene, which can be accessed by going the long way round of visiting the old www.dotlucene.net website, and then following the links to version 2.0 or greater. You will then be able to download a newer version of Lucene.

You will find then, that the search() method of the IndexSearcher will then expect a sort of some description. If you run search(query, Sort.RELEVANCE) that will perform a search the same as in old version 1.4 days. However, if you do a search such as search(query, new Sort("Fullname")) this will perform an alphabetic search on the fullname field!

There are loads more options, you can do other than just that - check out the Sort.cs class for options.

Friday, 11 April 2008

C# Lucene.NET - Part 3

Part 3 - I need a book, dagnamit!

This is a really short post.

Basically, I needed to know more about Lucene, so I found out that there was a book about Lucene, called "Lucene in Action". The books author really knows his stuff, and if you are new to Lucene, it is an absolute must for you to buy it.

The book is well structured, and explains Lucene from a ground up approach, starting with the basics. The book does not cover, however, the theories behind search algorithms, proximity, Tf-idf or the vector space model, so if you are after that information, use the links I have provided.

C# Lucene.NET - Part 2

Part 2 - Revenge of the HTML IFilter

So here's me, sat at my desk, happily indexing away using my nice little indexing server, and the Boss says - "all our html page titles are the same, and I don't want the file name to be shown as the title in the search results, but we do use the meta descripions a lot to detail what the page is about, can we use that?", "of course we can" I replied.

So, I go back into my indexing program, and implement a HTML Parser, and I add a new field to my Lucene document indexes when they are created, and I want to save the Meta tag called "content". I set my indexer off again, and great, no errors. I perform a search, but results come back for that field. I run it debug mode, and sure enough, when indexing, my HTML Parser couldn't pickup the HTML tag. Looking at the raw body of what the IFilters were doing, the HTML files were being passed into nlhtml.dll for filtering. Having read previously about IFilters and especially nlhtml.dll being used predominently for indexing HTML pages, I was really bemused as to why I couldn't gain access to the HTML tags - why it wouldn't let me get at them.

After two days, it dawned on me - that is exactly what an IFilter was supposed to do! the nlhtml.dll IFilter removed the "unnecessary" formatting from around the HTML Page, and only returned back to me what it thought was necessary, the content from within the tags - damn it! I need the META tags!

How did I overcome this you may ask (if you're interested that is) - well, I decided that I would supply a list to my search engine of files that I wanted to be considered as Plaintext - i.e. they needed to some "other" processing done on them in addition to being passed to a body content filtering process (in hindsight, I wish i'd called them html documents, rather than Plaintext, as it is more correct).

So here is what I cam up with - maybe not the most elegant, but by god it works! When it finds a file with the extension of a known Plaintext file, it still does the normal content body search, but it also does some specific searches for known html tags:

public Document LuceneDocument
{
get
{
Document doc = new Document();
doc.Add(Field.Keyword("name", fi.Name));
doc.Add(Field.Keyword("fullname", fi.FullName));
DirectoryInfo di = fi.Directory;
doc.Add(Field.Keyword("directparent", di.FullName));
while (di != null)
{
doc.Add(Field.Keyword("parent", di.FullName));
di = di.Parent;
}
doc.Add(Field.Keyword("created", DateField.DateToString(fi.CreationTime)));
doc.Add(Field.Keyword("modified", DateField.DateToString(fi.LastWriteTime)));
doc.Add(Field.Keyword("accessed", DateField.DateToString(fi.LastAccessTime)));
doc.Add(Field.Keyword("length", fi.Length.ToString()));
doc.Add(Field.UnIndexed("extension", fi.Extension));

//We need to know if this is a plaintextfile search - looking for meta info etc
//mainly used for htm, html, and asp files. This is set in config.xml
ArrayList sPlainTextFiles = new ArrayList(0);
SearchConfiguration cfg = null;
try
{
cfg = SearchConfiguration.Load(Directory.GetParent(Assembly.GetExecutingAssembly().Location) + "/config.xml");
if (cfg == null)
{
Log.Debug("Config file not found.");
}
else
{
sPlainTextFiles = cfg.PlainTextFiles;
}
}
catch (Exception e)
{
Log.Debug("Error loading the config file: " + e);
}
bool bPlainTextRequired = false;
for (int i = 0; i < extension ="=" bplaintextrequired =" true;" rawtext =" getBody();" plaintext = "" metacontent = "" metadescription = "" htmltitle = "" htmlh1 = "" plaintext =" getPlainBody();" metacontent =" getMetaContent(plainText);" metadescription =" getMetaDescription(plainText);" htmltitle =" getHtmlTitle(plainText);" htmlh1 =" getHtmlH1(plainText);" ret = ""> metaList = MetaParser.Parse(sText);
foreach (HtmlMeta meta in metaList)
{
if ((meta.Name.ToLower() == "title") && (meta.Content.Length > 0))
{
ret = meta.Content;
}
}
return ret;
}
private string getMetaDescription(string sText)
{
string ret = "";
List metaList = MetaParser.Parse(sText);
foreach (HtmlMeta meta in metaList)
{
if ((meta.Name.ToLower() == "description") && (meta.Content.Length > 0))
{
ret = meta.Content;
}
}
return ret;
}
private string getHtmlTitle(string sText)
{
string strOut = Regex.Match(sText,"(?<=title).*?(?=/title)", RegexOptions.IgnoreCase).Groups[0].Value; return strOut; } private string getHtmlH1(string sText) { string strOut = Regex.Match(sText, "(?<= h1).*?(?=/h1)", RegexOptions.IgnoreCase).Groups[0].Value;
Regex rx = new Regex(@"<[^\>]+\>");
return rx.Replace(strOut, "");
}

Please note that in the above regex, i've had to remove the <> from around the h1 and title tags, as it made the document go skew-wiff when published!

Did I miss something? Yes - getPlainBody() - well again, not very elegant, but after all, all we need if the plain text isn't it? So here is getPlainBody():

private string getPlainBody()
{
//If we need plain text, i.e. No IFiltering, then do this.
Log.Echo("Calling default parser for " + fi.FullName);
return PlainTextParser.Extract(fi.FullName);
}

And all the PlainTextParser is, is:

public class PlainTextParser
{
public PlainTextParser()
{
}

public static string Extract(string path)
{
string strRet = "";
try
{
StreamReader sr = new StreamReader(path);
strRet = sr.ReadToEnd();
//Remove carriage returns...
strRet.Replace("\r\n", "");
sr.Close();
}
catch (Exception ex)
{
Log.Echo("Plain Text Parser failed - " + ex.Message + ". Invoking default parser.");
strRet = DefaultParser.Extract(path);
}
return strRet;
}
}

Simple huh?

Well, if you have had the same problems as me, I hope this helps.

C# Lucene.NET - Part 1

A few weeks ago I was asked to implement a new website search system, as our current Microsoft Indexing Server was becoming a little unreliable, and was frequently not updating documents when they had been saved. After a little searching around, I came accross Lucene.

Lucene was originally designed in Java, but since has been ported to several other langauges including perl, php, c++ and of course, .NET - however, the support and documentation for it in .NET is dire to say the least.

Lucene is purely the engine to allow you to Index files, and then peform searches against it, however, you will have to write all that yourself. Whilst looking for help, I came accross another open source project, called Seekafile, but again, all support for this has been revoked. I perservered with Seekafile though, as it was a useful tool. If you are looking for the basics of Lucene, you are probably going to want to look here, at the most .NET documentation for Lucene I was able to find (and prepare to be horrified).

My intention is to use this Blog to detail my trails and tribulations with Seekafile and Lucene, so if you are just starting out with it, I really hope that some of this will be of assistance to you!

So, part 1!

What is an IFilter...... and why should I care?

Seekafile is really useful, don't get me wrong, but I had problems from day 1 getting my head around what an IFilter was, and why the hell Seekafile needed them. Well, in simple terms, an IFilter is an interface to a component that has been pre-written for you, that is able to read a file of a certain type, and return back to you the text content that has been stripped of all the formatting and file nonsense that the host application requires. For example, a MS Word document has plenty of formatting around it, that tells word where, how, and why you have created a pretty looking table in the document for example. When indexing, and searching for that matter, are you interested in that pretty table? no, you just want to know what is inside it, so that is where the IFilter comes in.

You must know this, Indexing Service uses IFilters for everything, so I assuming, that pretty much all Indexing technology does as well.

So, IFilters are great huh? Yes, and no. Indeed they have many benefits, such as the fact that Windows natively only supports IFiltering of several key documents (effectively), and if you want to Index something more obscure you simply get another IFilter to help you do this. However, the disadvantage is that you have no control over how the IFilter does this, unless you write your own :-

If you are looking at IFilters, go here first http://msdn2.microsoft.com/en-us/library/ms692488(VS.85).aspx where you can find a really useful insight into how Windows decides what IFilter to use for what file type. Obviously, if you follow this, it simply shows you where the IFilter is located, and that then tells you what DLL is going to be used. A more useful tool is from Citeknet, which allows you to take an abstract view of IFilters, as it shows you what you have installed on your system, and which will filter which type of file - a useful overview if you ask me. Microsoft, specifically aimed at Indexing Service, do provide a little info on IFilters in articles http://msdn2.microsoft.com/en-us/library/ms692540(VS.85).aspx and http://msdn2.microsoft.com/en-us/library/ms692582(VS.85).aspx and from this, you will see that the good folks at Microsoft help you filter pretty much all their own stuff. Checkout the Citeknet site for some more obscure IFilters.

Thursday, 10 April 2008

C# Web Crawling - Knowing when to download a new page

I have been using a c# web crawler written by Hatem Mostafa, which is available on code project, and I have found extremely useful.

I have altered this now to work with our database of websites, and hey-presto, I can now crawl all our websites as if I were a user in the external world.

However, the limitation is this - it is all well and good crawling thousands of pages, but downloading them is costly. So how can you overcome this? use the HTTP Header Entity Tag which is a hash to identify changes in the file. Check the ETag against a record of files already downloaded, and see if the Etag has changed, if it has, proceed to download. Simple really, and reduces a lot of external traffic on the server as well.

Of course, this presumes that the web server is configured to send ETag headers, and that the pages aren't so dynamic, the the ETag becomes irrelevant, or is ommitted!