Things that make you go "Hmmmm"

Wednesday, 14 May 2008

FileInfo FullName returns case insensitive results

I had a problem the other day whereby I was trying to record urls that contained encoded information as a querystring parameter, e.g. file.asp?param=4H4tTy7u

The problem was that even though I could see the files in case sensitive manner, when I saved the file to the filesystem, they were being read as case insensitive, e.g. file.asp?param=4h4tty7u

This caused major problems, as when I then unencoded the information in the querystring, I was getting completely the wrong information.

Upon tracing, it became apparent that although I could see the mixed cases in windows explorer, the underlying filesystem was case insensitive, and returned all results in lowercase - something which was causing the major problem.

To get round this, I first found an MS article that said it was caused by a .NET 2 framework bug from 2006 that set a registry value to indicate case insensitivity to the filesystem. After I changed this value and restarted, still it did not work.

After a bit of googling, I found the following article, which proved to solve the problem for me eventually - quite why such an elaborate solution is required is beyond me, but hey! of you have the same problem, at least here is the solution!

//within your code....
string sFileToProcess = fi.FullName;

string dir = Path.GetDirectoryName(sFileToProcess);
dir = ReplaceDirsWithExactCase(
dir,
Path.GetPathRoot(dir).ToUpper());
string filename =
Path.GetFileName(GetExactCaseForFilename(sFileToProcess));
Console.WriteLine(dir + "\\" + filename);

//methods needed for the above to work.
public static string ReplaceDirsWithExactCase(string fullpath, string parent)
{
if (fullpath.LastIndexOf(@"\") != fullpath.Length - 1)
fullpath += @"\";
if (parent.LastIndexOf(@"\") != parent.Length - 1)
parent += @"\";
string lookfor =
fullpath.ToLower().Replace(parent.ToLower(), "");
lookfor =
(parent + lookfor.Substring(0,
lookfor.IndexOf(@"\"))).ToLower();
string[] dirs =
Directory.GetDirectories(parent);
foreach (string dir in dirs)
if (dir.ToLower() == lookfor)
{
if (lookfor + @"\" == fullpath.ToLower())
return dir;
else
return ReplaceDirsWithExactCase(fullpath, dir);
}
return null;
}

public static string GetExactCaseForFilename(string file)
{
string[] files =
Directory.GetFiles(Path.GetDirectoryName(file));
foreach (string f in files)
if (f.ToLower() == file.ToLower())
return f;
return null;
}

Thursday, 24 April 2008

Automatically resizing an IFRAME based on it's content size

I had a case today when I must have spent 3 or 4 hours trying various elaborate exaples to try and get an IFrame to resize based upon the size of the target page of that IFrame. Many examples were suggesting have a method in the parent frame, that was to be called by the child frame, and in all instances it would just not work for me.

I found the following to work, and fortunately, it required no ammendments to the target page, just changes to the page that contained the IFRAME.

Firstly, a little snippet of JavaScript was need within the HEAD tags, as shown below:

function resizeMe(obj)
{
var docHeight = dynFrame.document.body.scrollHeight;
obj.style.height = docHeight + 'px';
}

Note that in < IE5.5, you should use offsetHeight rather than scrollHeight.

Then in an IFrame, add the attribute to trigger the resize when the iframe has loaded, as such:

onload="resizeMe(this);"

You should now have a resizing IFrame based on its content.

Thursday, 17 April 2008

Visual Studio .Net 2005 Hangs on Startup

Well, the subject of this post if what has been happening to me today. For no apparent reason, VS.NET 2005 would just hang at the splash screen. After trying to reinstall, it still would not load.

Here are some suggestions, collated from a morning of googling the forums:

1) Check that your harddrive has at least 1GB of space free
2) Try holding the shift key down when starting VS, so that you disable any addons that may have been corrupt.
3) Try installing the lastest .NET runtime, i.e. at present 3.5
4) As a last resort, trying clearing out the user environment settings - WARNING! This will erase all settings you have made in VS, so make a backup if you really need them, and only do this as a last resort. Go to the command prompt, and type "devenv.exe /resetuserdata"

If none of those work, you're pretty much stuffed I think! However, number 2 worked for me!

C# Lucene.NET - Sorting

I needed to perform a non-standard search in Lucene, basically a search that allowed me to order by something other than relevance - why? well, because the Boss wanted me to!

So, first of all, if you've been using Seekafile, you'll need to be aware that "out of the box", it comes with version 1.4 of Lucene. Verion 1.4 does not have the overloaded search method you need to supply a search to Lucene.

Firstly then, download the latest version of Lucene, which can be accessed by going the long way round of visiting the old www.dotlucene.net website, and then following the links to version 2.0 or greater. You will then be able to download a newer version of Lucene.

You will find then, that the search() method of the IndexSearcher will then expect a sort of some description. If you run search(query, Sort.RELEVANCE) that will perform a search the same as in old version 1.4 days. However, if you do a search such as search(query, new Sort("Fullname")) this will perform an alphabetic search on the fullname field!

There are loads more options, you can do other than just that - check out the Sort.cs class for options.

Friday, 11 April 2008

C# Lucene.NET - Part 3

Part 3 - I need a book, dagnamit!

This is a really short post.

Basically, I needed to know more about Lucene, so I found out that there was a book about Lucene, called "Lucene in Action". The books author really knows his stuff, and if you are new to Lucene, it is an absolute must for you to buy it.

The book is well structured, and explains Lucene from a ground up approach, starting with the basics. The book does not cover, however, the theories behind search algorithms, proximity, Tf-idf or the vector space model, so if you are after that information, use the links I have provided.

C# Lucene.NET - Part 2

Part 2 - Revenge of the HTML IFilter

So here's me, sat at my desk, happily indexing away using my nice little indexing server, and the Boss says - "all our html page titles are the same, and I don't want the file name to be shown as the title in the search results, but we do use the meta descripions a lot to detail what the page is about, can we use that?", "of course we can" I replied.

So, I go back into my indexing program, and implement a HTML Parser, and I add a new field to my Lucene document indexes when they are created, and I want to save the Meta tag called "content". I set my indexer off again, and great, no errors. I perform a search, but results come back for that field. I run it debug mode, and sure enough, when indexing, my HTML Parser couldn't pickup the HTML tag. Looking at the raw body of what the IFilters were doing, the HTML files were being passed into nlhtml.dll for filtering. Having read previously about IFilters and especially nlhtml.dll being used predominently for indexing HTML pages, I was really bemused as to why I couldn't gain access to the HTML tags - why it wouldn't let me get at them.

After two days, it dawned on me - that is exactly what an IFilter was supposed to do! the nlhtml.dll IFilter removed the "unnecessary" formatting from around the HTML Page, and only returned back to me what it thought was necessary, the content from within the tags - damn it! I need the META tags!

How did I overcome this you may ask (if you're interested that is) - well, I decided that I would supply a list to my search engine of files that I wanted to be considered as Plaintext - i.e. they needed to some "other" processing done on them in addition to being passed to a body content filtering process (in hindsight, I wish i'd called them html documents, rather than Plaintext, as it is more correct).

So here is what I cam up with - maybe not the most elegant, but by god it works! When it finds a file with the extension of a known Plaintext file, it still does the normal content body search, but it also does some specific searches for known html tags:

public Document LuceneDocument
{
get
{
Document doc = new Document();
doc.Add(Field.Keyword("name", fi.Name));
doc.Add(Field.Keyword("fullname", fi.FullName));
DirectoryInfo di = fi.Directory;
doc.Add(Field.Keyword("directparent", di.FullName));
while (di != null)
{
doc.Add(Field.Keyword("parent", di.FullName));
di = di.Parent;
}
doc.Add(Field.Keyword("created", DateField.DateToString(fi.CreationTime)));
doc.Add(Field.Keyword("modified", DateField.DateToString(fi.LastWriteTime)));
doc.Add(Field.Keyword("accessed", DateField.DateToString(fi.LastAccessTime)));
doc.Add(Field.Keyword("length", fi.Length.ToString()));
doc.Add(Field.UnIndexed("extension", fi.Extension));

//We need to know if this is a plaintextfile search - looking for meta info etc
//mainly used for htm, html, and asp files. This is set in config.xml
ArrayList sPlainTextFiles = new ArrayList(0);
SearchConfiguration cfg = null;
try
{
cfg = SearchConfiguration.Load(Directory.GetParent(Assembly.GetExecutingAssembly().Location) + "/config.xml");
if (cfg == null)
{
Log.Debug("Config file not found.");
}
else
{
sPlainTextFiles = cfg.PlainTextFiles;
}
}
catch (Exception e)
{
Log.Debug("Error loading the config file: " + e);
}
bool bPlainTextRequired = false;
for (int i = 0; i < extension ="=" bplaintextrequired =" true;" rawtext =" getBody();" plaintext = "" metacontent = "" metadescription = "" htmltitle = "" htmlh1 = "" plaintext =" getPlainBody();" metacontent =" getMetaContent(plainText);" metadescription =" getMetaDescription(plainText);" htmltitle =" getHtmlTitle(plainText);" htmlh1 =" getHtmlH1(plainText);" ret = ""> metaList = MetaParser.Parse(sText);
foreach (HtmlMeta meta in metaList)
{
if ((meta.Name.ToLower() == "title") && (meta.Content.Length > 0))
{
ret = meta.Content;
}
}
return ret;
}
private string getMetaDescription(string sText)
{
string ret = "";
List metaList = MetaParser.Parse(sText);
foreach (HtmlMeta meta in metaList)
{
if ((meta.Name.ToLower() == "description") && (meta.Content.Length > 0))
{
ret = meta.Content;
}
}
return ret;
}
private string getHtmlTitle(string sText)
{
string strOut = Regex.Match(sText,"(?<=title).*?(?=/title)", RegexOptions.IgnoreCase).Groups[0].Value; return strOut; } private string getHtmlH1(string sText) { string strOut = Regex.Match(sText, "(?<= h1).*?(?=/h1)", RegexOptions.IgnoreCase).Groups[0].Value;
Regex rx = new Regex(@"<[^\>]+\>");
return rx.Replace(strOut, "");
}

Please note that in the above regex, i've had to remove the <> from around the h1 and title tags, as it made the document go skew-wiff when published!

Did I miss something? Yes - getPlainBody() - well again, not very elegant, but after all, all we need if the plain text isn't it? So here is getPlainBody():

private string getPlainBody()
{
//If we need plain text, i.e. No IFiltering, then do this.
Log.Echo("Calling default parser for " + fi.FullName);
return PlainTextParser.Extract(fi.FullName);
}

And all the PlainTextParser is, is:

public class PlainTextParser
{
public PlainTextParser()
{
}

public static string Extract(string path)
{
string strRet = "";
try
{
StreamReader sr = new StreamReader(path);
strRet = sr.ReadToEnd();
//Remove carriage returns...
strRet.Replace("\r\n", "");
sr.Close();
}
catch (Exception ex)
{
Log.Echo("Plain Text Parser failed - " + ex.Message + ". Invoking default parser.");
strRet = DefaultParser.Extract(path);
}
return strRet;
}
}

Simple huh?

Well, if you have had the same problems as me, I hope this helps.

C# Lucene.NET - Part 1

A few weeks ago I was asked to implement a new website search system, as our current Microsoft Indexing Server was becoming a little unreliable, and was frequently not updating documents when they had been saved. After a little searching around, I came accross Lucene.

Lucene was originally designed in Java, but since has been ported to several other langauges including perl, php, c++ and of course, .NET - however, the support and documentation for it in .NET is dire to say the least.

Lucene is purely the engine to allow you to Index files, and then peform searches against it, however, you will have to write all that yourself. Whilst looking for help, I came accross another open source project, called Seekafile, but again, all support for this has been revoked. I perservered with Seekafile though, as it was a useful tool. If you are looking for the basics of Lucene, you are probably going to want to look here, at the most .NET documentation for Lucene I was able to find (and prepare to be horrified).

My intention is to use this Blog to detail my trails and tribulations with Seekafile and Lucene, so if you are just starting out with it, I really hope that some of this will be of assistance to you!

So, part 1!

What is an IFilter...... and why should I care?

Seekafile is really useful, don't get me wrong, but I had problems from day 1 getting my head around what an IFilter was, and why the hell Seekafile needed them. Well, in simple terms, an IFilter is an interface to a component that has been pre-written for you, that is able to read a file of a certain type, and return back to you the text content that has been stripped of all the formatting and file nonsense that the host application requires. For example, a MS Word document has plenty of formatting around it, that tells word where, how, and why you have created a pretty looking table in the document for example. When indexing, and searching for that matter, are you interested in that pretty table? no, you just want to know what is inside it, so that is where the IFilter comes in.

You must know this, Indexing Service uses IFilters for everything, so I assuming, that pretty much all Indexing technology does as well.

So, IFilters are great huh? Yes, and no. Indeed they have many benefits, such as the fact that Windows natively only supports IFiltering of several key documents (effectively), and if you want to Index something more obscure you simply get another IFilter to help you do this. However, the disadvantage is that you have no control over how the IFilter does this, unless you write your own :-

If you are looking at IFilters, go here first http://msdn2.microsoft.com/en-us/library/ms692488(VS.85).aspx where you can find a really useful insight into how Windows decides what IFilter to use for what file type. Obviously, if you follow this, it simply shows you where the IFilter is located, and that then tells you what DLL is going to be used. A more useful tool is from Citeknet, which allows you to take an abstract view of IFilters, as it shows you what you have installed on your system, and which will filter which type of file - a useful overview if you ask me. Microsoft, specifically aimed at Indexing Service, do provide a little info on IFilters in articles http://msdn2.microsoft.com/en-us/library/ms692540(VS.85).aspx and http://msdn2.microsoft.com/en-us/library/ms692582(VS.85).aspx and from this, you will see that the good folks at Microsoft help you filter pretty much all their own stuff. Checkout the Citeknet site for some more obscure IFilters.